Xavier Tannier, home page (page personnelle)

Adam Remaki, Jacques Ung, Pierre Pages, Perceval Wajsbürt, Elise Liu, Guillaume Faure, Thomas Petit-Jean, Xavier Tannier, Christel Gérardin.

Improving Phenotyping of Patients With Immune-Mediated Inflammatory Diseases Through Automated Processing of Discharge Summaries: Multicenter Cohort Study.

JMIR Medical Informatics. 13, April 2025. doi: 10.2196/68704

ⓘ [abstract] [BibTeX] [JMIR link]

Background: Valuable insights gathered by clinicians during their inquiries and documented in textual reports are often unavailable in the structured data recorded in electronic health records (EHRs).
Objective: This study aimed to highlight that mining unstructured textual data with natural language processing techniques complements the available structured data and enables more comprehensive patient phenotyping. A proof-of-concept for patients diagnosed with specific autoimmune diseases is presented, in which the extraction of information on laboratory tests and drug treatments is performed.
Methods: We collected EHRs available in the clinical data warehouse of the Greater Paris University Hospitals from 2012 to 2021 for patients hospitalized and diagnosed with 1 of 4 immune-mediated inflammatory diseases: systemic lupus erythematosus, systemic sclerosis, antiphospholipid syndrome, and Takayasu arteritis. Then, we built, trained, and validated natural language processing algorithms on 103 discharge summaries selected from the cohort and annotated by a clinician. Finally, all discharge summaries in the cohort were processed with the algorithms, and the extracted data on laboratory tests and drug treatments were compared with the structured data.
Results: Named entity recognition followed by normalization yielded F1-scores of 71.1 (95% CI 63.6-77.8) for the laboratory tests and 89.3 (95% CI 85.9-91.6) for the drugs. Application of the algorithms to 18,604 EHRs increased the detection of antibody results and drug treatments. For instance, among patients in the systemic lupus erythematosus cohort with positive antinuclear antibodies, the rate increased from 18.34% (752/4102) to 71.87% (2949/4102), making the results more consistent with the literature.
Conclusions: While challenges remain in standardizing laboratory tests, particularly with abbreviations, this work, based on secondary use of clinical data, demonstrates that automated processing of discharge summaries enriched the information available in structured data and facilitated more comprehensive patient profiling.

@Article{Remaki2025, 
  title = {{Improving Phenotyping of Patients With Immune-Mediated Inflammatory Diseases Through Automated Processing of Discharge Summaries: Multicenter Cohort Study}},
  author = {Remaki, Adam and Ung, Jacques and Pages, Pierre and Wajsbürt, Perceval and Liu, Elise and Faure, Guillaume and Petit-Jean, Thomas and Tannier, Xavier and Gérardin, Christel},
  year = {2025}, 
  month = apr, 
  journal = {JMIR Medical Informatics}, 
  volume = {13}, 
  doi = {10.2196/68704}
}

Chi-en Amy Tai, Xavier Tannier.

Clinical trial cohort selection using Large Language Models on n2c2 Challenges.

January 2025.

arXiv

ⓘ [abstract] [BibTeX] [arXiv]

Clinical trials are a critical process in the medical field for introducing new treatments and innovations. However, cohort selection for clinical trials is a time-consuming process that often requires manual review of patient text records for specific keywords. Though there have been studies on standardizing the information across the various platforms, Natural Language Processing (NLP) tools remain crucial for spotting eligibility criteria in textual reports. Recently, pre-trained large language models (LLMs) have gained popularity for various NLP tasks due to their ability to acquire a nuanced understanding of text. In this paper, we study the performance of large language models on clinical trial cohort selection and leverage the n2c2 challenges to benchmark their performance. Our results are promising with regard to the incorporation of LLMs for simple cohort selection tasks, but also highlight the difficulties encountered by these models as soon as fine-grained knowledge and reasoning are required.

@Misc{Tai2025, 
  title = {{Clinical trial cohort selection using Large Language Models on n2c2 Challenges}},
  author = {Tai, Chi-en Amy and Tannier, Xavier},
  year = {2025}, 
  month = jan, 
  note = {arXiv}
}

Marco Naguib, Xavier Tannier, Aurélie Névéol.

Few-shot clinical entity recognition in English, French and Spanish: masked language models outperform generative model prompting.

in Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024. Miami, USA, November 2024. © Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.400

ⓘ [abstract] [BibTeX] [ACL Anthology]

Large language models (LLMs) have become the preferred solution for many natural language processing tasks. In low-resource environments such as specialized domains, their few-shot capabilities are expected to deliver high performance. Named Entity Recognition (NER) is a critical task in information extraction that is not covered in recent LLM benchmarks. There is a need for better understanding the performance of LLMs for NER in a variety of settings including languages other than English. This study aims to evaluate generative LLMs, employed through prompt engineering, for few-shot clinical NER. We compare 13 auto-regressive models using prompting and 16 masked models using fine-tuning on 14 NER datasets covering English, French and Spanish. While prompt-based auto-regressive models achieve competitive F1 for general NER, they are outperformed within the clinical domain by lighter biLSTM-CRF taggers based on masked models. Additionally, masked models exhibit lower environmental impact compared to auto-regressive models. Findings are consistent across the three languages studied, which suggests that LLM prompting is not yet suited for NER production in the clinical domain.

@InProceedings{Naguib2024b, 
  title = {{Few-shot clinical entity recognition in English, French and Spanish: masked language models outperform generative model prompting}},
  author = {Naguib, Marco and Tannier, Xavier and Névéol, Aurélie},
  booktitle = {Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024}, 
  address = {Miami, USA}, 
  year = {2024}, 
  month = nov, 
  publisher = {Association for Computational Linguistics}, 
  doi = {10.18653/v1/2024.findings-emnlp.400}
}

Jamil Zaghir, Marco Naguib, Mina Bjelogrlic, Aurélie Névéol, Xavier Tannier, Christian Lovis.

Prompt Engineering Paradigms for Medical Applications: Scoping Review.

Journal of Medical Internet Research. September 2024. doi: 10.2196/60501

ⓘ [abstract] [BibTeX] [JMIR Link]

Background: Prompt engineering, focusing on crafting effective prompts to large language models (LLMs), has garnered attention for its capabilities at harnessing the potential of LLMs. This is even more crucial in the medical domain due to its specialized terminology and language technicity. Clinical natural language processing applications must navigate complex language and ensure privacy compliance. Prompt engineering offers a novel approach by designing tailored prompts to guide models in exploiting clinically relevant information from complex medical texts. Despite its promise, the efficacy of prompt engineering in the medical domain remains to be fully explored.
Objective: The aim of the study is to review research efforts and technical approaches in prompt engineering for medical applications as well as provide an overview of opportunities and challenges for clinical practice.
Methods: Databases indexing the fields of medicine, computer science, and medical informatics were queried in order to identify relevant published papers. Since prompt engineering is an emerging field, preprint databases were also considered. Multiple data were extracted, such as the prompt paradigm, the involved LLMs, the languages of the study, the domain of the topic, the baselines, and several learning, design, and architecture strategies specific to prompt engineering. We include studies that apply prompt engineering–based methods to the medical domain, published between 2022 and 2024, and covering multiple prompt paradigms such as prompt learning (PL), prompt tuning (PT), and prompt design (PD).
Results: We included 114 recent prompt engineering studies. Among the 3 prompt paradigms, we have observed that PD is the most prevalent (78 papers). In 12 papers, PD, PL, and PT terms were used interchangeably. While ChatGPT is the most commonly used LLM, we have identified 7 studies using this LLM on a sensitive clinical data set. Chain-of-thought, present in 17 studies, emerges as the most frequent PD technique. While PL and PT papers typically provide a baseline for evaluating prompt-based approaches, 61% (48/78) of the PD studies do not report any nonprompt-related baseline. Finally, we individually examine each of the key prompt engineering–specific information reported across papers and find that many studies neglect to explicitly mention them, posing a challenge for advancing prompt engineering research.
Conclusions: In addition to reporting on trends and the scientific landscape of prompt engineering, we provide reporting guidelines for future studies to help advance research in the medical field. We also disclose tables and figures summarizing medical prompt engineering papers available and hope that future contributions will leverage these existing works to better advance the field.

@Article{Zaghir2024, 
  title = {{Prompt Engineering Paradigms for Medical Applications: Scoping Review}},
  author = {Zaghir, Jamil and Naguib, Marco and Bjelogrlic, Mina and Névéol, Aurélie and Tannier, Xavier and Lovis, Christian},
  year = {2024}, 
  month = sep, 
  journal = {Journal of Medical Internet Research}, 
  doi = {10.2196/60501}
}

Ariel Cohen, Alexandrine Lanson, Emmanuelle Kempf, Xavier Tannier.

Leveraging Information Redundancy of Real-World Data through Distant Supervision.

in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Torino, Italia, pages 10352–10364, May 2024. © ELRA and ICCL.

ⓘ [abstract] [BibTeX] [free copy]

We explore the task of event extraction and classification by harnessing the power of distant supervision. We present a novel text labeling method that leverages the redundancy of temporal information in a data lake. This method enables the creation of a large programmatically annotated corpus, allowing the training of transformer models using distant supervision. This aims to reduce expert annotation time, a scarce and expensive resource. Our approach utilizes temporal redundancy between structured sources and text, enabling the design of a replicable framework applicable to diverse real-world databases and use cases. We employ this method to create multiple silver datasets to reconstruct key events in cancer patients’ pathways, using clinical notes from a cohort of 380,000 oncological patients. By employing various noise label management techniques, we validate our end-to-end approach and compare it with a baseline classifier built on expert-annotated data. The implications of our work extend to accelerating downstream applications, such as patient recruitment for clinical trials, treatment effectiveness studies, survival analysis, and epidemiology research. While our study showcases the potential of the method, there remain avenues for further exploration, including advanced noise management techniques, semi-supervised approaches, and a deeper understanding of biases in the generated datasets and models.

@InProceedings{Cohen2024, 
  title = {{Leveraging Information Redundancy of Real-World Data through Distant Supervision}},
  author = {Cohen, Ariel and Lanson, Alexandrine and Kempf, Emmanuelle and Tannier, Xavier},
  booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}, 
  address = {Torino, Italia}, 
  year = {2024}, 
  month = may, 
  publisher = {ELRA and ICCL}, 
  pages = {10352–10364}
}

Nesrine Bannour, Christophe Servan, Aurélie Névéol, Xavier Tannier.

A Benchmark Evaluation of Clinical Named Entity Recognition in French.

in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Torino, Italia, pages 14-21, May 2024. © ELRA and ICCL.

ⓘ [abstract] [BibTeX] [free copy]

Background: Transformer-based language models have shown strong performance on many Natural Language Processing (NLP) tasks. Masked Language Models (MLMs) attract sustained interest because they can be adapted to different languages and sub-domains through training or fine-tuning on specific corpora while remaining lighter than modern Large Language Models (MLMs). Recently, several MLMs have been released for the biomedical domain in French, and experiments suggest that they outperform standard French counterparts. However, no systematic evaluation comparing all models on the same corpora is available. Objective: This paper presents an evaluation of masked language models for biomedical French on the task of clinical named entity recognition. Material and methods: We evaluate biomedical models CamemBERT-bio and DrBERT and compare them to standard French models CamemBERT, FlauBERT and FrAlBERT as well as multilingual mBERT using three publically available corpora for clinical named entity recognition in French. The evaluation set-up relies on gold-standard corpora as released by the corpus developers. Results: Results suggest that CamemBERT-bio outperforms DrBERT consistently while FlauBERT offers competitive performance and FrAlBERT achieves the lowest carbon footprint. Conclusion: This is the first benchmark evaluation of biomedical masked language models for French clinical entity recognition that compares model performance consistently on nested entity recognition using metrics covering performance and environmental impact.

@InProceedings{Bannour2024, 
  title = {{A Benchmark Evaluation of Clinical Named Entity Recognition in French}},
  author = {Bannour, Nesrine and Servan, Christophe and Névéol, Aurélie and Tannier, Xavier},
  booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}, 
  address = {Torino, Italia}, 
  year = {2024}, 
  month = may, 
  publisher = {ELRA and ICCL}, 
  pages = {14-21}
}

Christel Gérardin, Yuhan Xiong, Perceval Wajsbürt, Fabrice Carrat, Xavier Tannier.

Impact of translation on biomedical information extraction: an experiment on real-life clinical notes.

JMIR Medical Informatics. January 2024. doi: 10.2196/49607

ⓘ [abstract] [BibTeX] [JMIR Link]

Background:Biomedical natural language processing tasks are best performed with English models, and translation tools have undergone major improvements. On the other hand, building annotated biomedical datasets remains a challenge.
Objective: The aim of our study is to determine whether the use of English tools to extract and normalize French medical concepts on translations provides comparable performance to that of French models trained on a set of annotated French clinical notes.
Methods: We compare two methods: one involving French-language models and one involving English-language models. For the native French method, the Named Entity Recognition (NER) and normalization steps are performed separately. For the translated English method, after the first translation step, we compare a two-step method and a terminology-oriented method that performs extraction and normalization at the same time. We used French, English and bilingual annotated datasets to evaluate all stages (NER, normalization and translation) of our algorithms.
Results: The native French method outperformed the translated English method, with an overall f1 score of 0.51 [0.47;0.55], compared with 0.39 [0.34;0.44] and 0.38 [0.36;0.40] for the two English methods tested.
Conclusions: Despite recent improvements in translation models, there is a significant difference in performance between the two approaches in favor of the native French method, which is more effective on French medical texts, even with few annotated documents.

@Article{Gerardin2024, 
  title = {{Impact of translation on biomedical information extraction: an experiment on real-life clinical notes}},
  author = {Gérardin, Christel and Xiong, Yuhan and Wajsbürt, Perceval and Carrat, Fabrice and Tannier, Xavier},
  year = {2024}, 
  month = jan, 
  journal = {JMIR Medical Informatics}, 
  doi = {10.2196/49607}
}

Thomas Petit-Jean, Christel Gérardin, Emmanuelle Berthelot, Gilles Chatellier, Marie Franck, Xavier Tannier, Emmanuelle Kempf, Romain Bey.

Collaborative and privacy-enhancing workflows on a clinical data warehouse: an example developing natural language processing pipelines to detect medical conditions.

Journal of the American Medical Informatics Association. Vol. 31, Issue 6, April 2024. doi: 10.1093/jamia/ocae069

ⓘ [abstract] [BibTeX] [JAMIA Link]

Objective: To develop and validate a natural language processing (NLP) pipeline that detects 18 conditions in French clinical notes, including 16 comorbidities of the Charlson index, while exploring a collaborative and privacy-enhancing workflow.
Materials and Methods: The detection pipeline relied both on rule-based and machine learning algorithms, respectively, for named entity recognition and entity qualification, respectively. We used a large language model pre-trained on millions of clinical notes along with annotated clinical notes in the context of 3 cohort studies related to oncology, cardiology, and rheumatology. The overall workflow was conceived to foster collaboration between studies while respecting the privacy constraints of the data warehouse. We estimated the added values of the advanced technologies and of the collaborative setting.
Results: The pipeline reached macro-averaged F1-score positive predictive value, sensitivity, and specificity of 95.7 (95%CI 94.5-96.3), 95.4 (95%CI 94.0-96.3), 96.0 (95%CI 94.0-96.7), and 99.2 (95%CI 99.0-99.4), respectively. F1-scores were superior to those observed using alternative technologies or non-collaborative settings. The models were shared through a secured registry.
Conclusions: We demonstrated that a community of investigators working on a common clinical data warehouse could efficiently and securely collaborate to develop, validate and use sensitive artificial intelligence models. In particular, we provided an efficient and robust NLP pipeline that detects conditions mentioned in clinical notes.

@Article{PetitJean2024, 
  title = {{Collaborative and privacy-enhancing workflows on a clinical data warehouse: an example developing natural language processing pipelines to detect medical conditions}},
  author = {Petit-Jean, Thomas and Gérardin, Christel and Berthelot, Emmanuelle and Chatellier, Gilles and Franck, Marie and Tannier, Xavier and Kempf, Emmanuelle and Bey, Romain},
  number = {6}, 
  year = {2024}, 
  month = apr, 
  journal = {Journal of the American Medical Informatics Association}, 
  volume = {31}, 
  doi = {10.1093/jamia/ocae069}
}

Xavier Tannier, Perceval Wajsbürt, Alice Calliger, Basile Dura, Alexandre Mouchet, Martin Hilka, Romain Bey.

Development and Validation of a Natural Language Processing Algorithm to Pseudonymize Documents in the Context of a Clinical Data Warehouse.

Methods of Information in Medicine. Vol. 63, Issue 01/02, March 2024. doi: 10.1055/s-0044-1778693

ⓘ [abstract] [BibTeX] [Thieme Link]

Objective: The objective of this study is to address the critical issue of deidentification of clinical reports to allow access to data for research purposes, while ensuring patient privacy. The study highlights the difficulties faced in sharing tools and resources in this domain and presents the experience of the Greater Paris University Hospitals (AP-HP for Assistance Publique-Hôpitaux de Paris) in implementing a systematic pseudonymization of text documents from its Clinical Data Warehouse.
Methods: We annotated a corpus of clinical documents according to 12 types of identifying entities and built a hybrid system, merging the results of a deep learning model as well as manual rules.
Results and Discussion: Our results show an overall performance of 0.99 of F1-score. We discuss implementation choices and present experiments to better understand the effort involved in such a task, including dataset size, document types, language models, or rule addition. We share guidelines and code under a 3-Clause BSD license.

@Article{Tannier2024, 
  title = {{Development and Validation of a Natural Language Processing Algorithm to Pseudonymize Documents in the Context of a Clinical Data Warehouse}},
  author = {Tannier, Xavier and Wajsbürt, Perceval and Calliger, Alice and Dura, Basile and Mouchet, Alexandre and Hilka, Martin and Bey, Romain},
  number = {01/02}, 
  year = {2024}, 
  month = mar, 
  journal = {Methods of Information in Medicine}, 
  volume = {63}, 
  doi = {10.1055/s-0044-1778693}
}

Romain Bey, Ariel Cohen, Vincent Trebossen, Basile Dura, Pierre-Alexis Geoffroy, Charline Jean, Benjamin Landman, Thomas Petit-Jean, Gilles Chatellier, Kankoe Sallah, Xavier Tannier, Aurelie Bourmaud, Richard Delorme.

Natural language processing of multi-hospital electronic health records for public health surveillance of suicidality.

npj Mental Health Research. Vol. 3, Issue 6, February 2024. doi: 10.1038/s44184-023-00046-7

ⓘ [abstract] [BibTeX] [Nature Link]

There is an urgent need to monitor the mental health of large populations, especially during crises such as the COVID-19 pandemic, to timely identify the most at-risk subgroups and to design targeted prevention campaigns. We therefore developed and validated surveillance indicators related to suicidality: the monthly number of hospitalisations caused by suicide attempts and the prevalence among them of five known risks factors. They were automatically computed analysing the electronic health records of fifteen university hospitals of the Paris area, France, using natural language processing algorithms based on artificial intelligence. We evaluated the relevance of these indicators conducting a retrospective cohort study. Considering 2,911,920 records contained in a common data warehouse, we tested for changes after the pandemic outbreak in the slope of the monthly number of suicide attempts by conducting an interrupted time-series analysis. We segmented the assessment time in two sub-periods: before (August 1, 2017, to February 29, 2020) and during (March 1, 2020, to June 31, 2022) the COVID-19 pandemic. We detected 14,023 hospitalisations caused by suicide attempts. Their monthly number accelerated after the COVID-19 outbreak with an estimated trend variation reaching 3.7 (95%CI 2.1–5.3), mainly driven by an increase among girls aged 8–17 (trend variation 1.8, 95%CI 1.2–2.5). After the pandemic outbreak, acts of domestic, physical and sexual violence were more often reported (prevalence ratios: 1.3, 95%CI 1.16–1.48; 1.3, 95%CI 1.10–1.64 and 1.7, 95%CI 1.48–1.98), fewer patients died (p = 0.007) and stays were shorter (p < 0.001). Our study demonstrates that textual clinical data collected in multiple hospitals can be jointly analysed to compute timely indicators describing mental health conditions of populations. Our findings also highlight the need to better take into account the violence imposed on women, especially at early ages and in the aftermath of the COVID-19 pandemic.

@Article{Bey2024, 
  title = {{Natural language processing of multi-hospital electronic health records for public health surveillance of suicidality}},
  author = {Romain Bey and Ariel Cohen and Vincent Trebossen and Basile Dura and Pierre-Alexis Geoffroy and Charline Jean and Benjamin Landman and Thomas Petit-Jean and Gilles Chatellier and Kankoe Sallah and Xavier Tannier and Aurelie Bourmaud and Richard Delorme},
  number = {6}, 
  year = {2024}, 
  month = feb, 
  journal = {npj Mental Health Research}, 
  volume = {3}, 
  doi = {10.1038/s44184-023-00046-7}
}

Marco Naguib, Aurélie Névéol, Xavier Tannier.

Reconnaissance d’entités cliniques en few-shot en trois langues.

in Actes de la 31ème conférence Traitement Automatique des Langues Naturelles (TALN 2024). Toulouse, France, July 2024.

ⓘ [abstract] [BibTeX] [pdf]

Les grands modèles de langage deviennent la solution de choix pour de nombreuses tâches de traitement du langage naturel, y compris dans des domaines spécialisés où leurs capacités few-shot devraient permettre d’obtenir des performances élevées dans des environnements à faibles ressources. Cependant, notre évaluation de 10 modèles auto-régressifs et 16 modèles masqués montre que, bien que les modèles auto-régressifs utilisant des prompts puissent rivaliser en termes de reconnaissance d’entités nommées (REN) en dehors du domaine clinique, ils sont dépassés dans le domaine clinique par des taggers biLSTM-CRF plus légers reposant sur des modèles masqués. De plus, les modèles masqués ont un bien moindre impact environnemental que les modèles auto-régressifs. Ces résultats, cohérents dans les trois langues étudiées, suggèrent que les modèles à apprentissage few-shot ne sont pas encore adaptés à la production de REN dans le domaine clinique, mais pourraient être utilisés pour accélérer la création de données annotées de qualité.

@InProceedings{Naguib2024, 
  title = {{Reconnaissance d’entités cliniques en few-shot en trois langues}},
  author = {Naguib, Marco and Névéol, Aurélie and Tannier, Xavier},
  booktitle = {Actes de la 31ème conférence Traitement Automatique des Langues Naturelles (TALN 2024)}, 
  address = {Toulouse, France}, 
  year = {2024}, 
  month = jul
}

Solène Delourme, Adam Remaki, Christel Gérardin, Pascal Vaillant, Xavier Tannier, Brigitte Séroussi, Akram Redjdal.

LIMICS@DEFT'24 : Un mini-LLM peut-il tricher aux QCM de pharmacie en fouillant dans Wikipédia et NACHOS ?.

in Défi Fouille de Texte (DEFT), Traitement Automatique des Langues Naturelles, 2024. Toulouse, France, July 2024.

ⓘ [abstract] [BibTeX] [HAL link]

Ce papier explore deux approches pour répondre aux questions à choix multiples (QCM) de pharmacie du défi DEFT 2024 en utilisant des modèles de langue (LLMs) entraînés sur des données ouvertes avec moins de 3 milliards de paramètres. Les deux approches reposent sur l'architecture RAG (Retrieval Augmented Generation) pour combiner la récupération de contexte à partir de bases de connaissances externes (NACHOS et Wikipédia) avec la génération de réponses par le LLM Apollo-2B. La première approche traite directement les QCMs et génère les réponses en une seule étape, tandis que la seconde approche reformule les QCMs en questions binaires (Oui/Non) puis génère une réponse pour chaque question binaire. Cette dernière approche obtient un Exact Match Ratio de 14.7 et un Hamming Score de 51.6 sur le jeu de test, ce qui démontre le potentiel du RAG pour des tâches de Q/A sous de telles contraintes.

@InProceedings{Delourme2024, 
  title = {{LIMICS@DEFT'24 : Un mini-LLM peut-il tricher aux QCM de pharmacie en fouillant dans Wikipédia et NACHOS ?}},
  author = {Delourme, Solène and Remaki, Adam and Gérardin, Christel and Vaillant, Pascal and Tannier, Xavier and Séroussi, Brigitte and Redjdal, Akram},
  booktitle = {Défi Fouille de Texte (DEFT), Traitement Automatique des Langues Naturelles, 2024}, 
  address = {Toulouse, France}, 
  year = {2024}, 
  month = jul
}

Emmanuelle Kempf, Sonia Priou, Akram Redjdal, Étienne Guével, Xavier Tannier.

The More, the Better? Modalities of Metastatic Status Extraction on Free Medical Reports Based on Natural Language Processing (Response to Ahumada et al on Methodological and Practical Aspects of a Distant Metastasis Detection Model).

JCO Clinical Cancer Informatics. 8, August 2024. doi: 10.1200/CCI.24.00026

ⓘ [BibTeX] [Direct link] [Ask me!]

Ce papier explore deux approches pour répondre aux questions à choix multiples (QCM) de pharmacie du défi DEFT 2024 en utilisant des modèles de langue (LLMs) entraînés sur des données ouvertes avec moins de 3 milliards de paramètres. Les deux approches reposent sur l'architecture RAG (Retrieval Augmented Generation) pour combiner la récupération de contexte à partir de bases de connaissances externes (NACHOS et Wikipédia) avec la génération de réponses par le LLM Apollo-2B. La première approche traite directement les QCMs et génère les réponses en une seule étape, tandis que la seconde approche reformule les QCMs en questions binaires (Oui/Non) puis génère une réponse pour chaque question binaire. Cette dernière approche obtient un Exact Match Ratio de 14.7 et un Hamming Score de 51.6 sur le jeu de test, ce qui démontre le potentiel du RAG pour des tâches de Q/A sous de telles contraintes.

@Article{Kempf2024b, 
  title = {{The More, the Better? Modalities of Metastatic Status Extraction on Free Medical Reports Based on Natural Language Processing (Response to Ahumada et al on Methodological and Practical Aspects of a Distant Metastasis Detection Model)}},
  author = {Kempf, Emmanuelle and Priou, Sonia and Redjdal, Akram and Guével, Étienne and Tannier, Xavier},
  year = {2024}, 
  month = aug, 
  journal = {JCO Clinical Cancer Informatics}, 
  volume = {8}, 
  doi = {10.1200/CCI.24.00026}
}

Christel Gérardin, Adam Remaki, Jacques Ung, P Pagès, Perceval Wajsbürt, Guillaume Faure, Thomas Petit-Jean, Xavier Tannier.

Améliorer la caractérisation phénotypique des patients atteints de maladies inflammatoires à médiation immunitaire par l’analyse automatique des comptes-rendus hospitaliers.

in 89ème congrès français de médecine interne, Revue de Médecine Interne. March 2024.

ⓘ [BibTeX] [ScienceDirect Link]

Ce papier explore deux approches pour répondre aux questions à choix multiples (QCM) de pharmacie du défi DEFT 2024 en utilisant des modèles de langue (LLMs) entraînés sur des données ouvertes avec moins de 3 milliards de paramètres. Les deux approches reposent sur l'architecture RAG (Retrieval Augmented Generation) pour combiner la récupération de contexte à partir de bases de connaissances externes (NACHOS et Wikipédia) avec la génération de réponses par le LLM Apollo-2B. La première approche traite directement les QCMs et génère les réponses en une seule étape, tandis que la seconde approche reformule les QCMs en questions binaires (Oui/Non) puis génère une réponse pour chaque question binaire. Cette dernière approche obtient un Exact Match Ratio de 14.7 et un Hamming Score de 51.6 sur le jeu de test, ce qui démontre le potentiel du RAG pour des tâches de Q/A sous de telles contraintes.

@InProceedings{Gerardin2024b, 
  title = {{Améliorer la caractérisation phénotypique des patients atteints de maladies inflammatoires à médiation immunitaire par l’analyse automatique des comptes-rendus hospitaliers}},
  author = {Gérardin, Christel and Remaki, Adam and Ung, Jacques and Pagès, P and Wajsbürt, Perceval and Faure, Guillaume and Petit-Jean, Thomas and Tannier, Xavier},
  booktitle = {89ème congrès français de médecine interne, Revue de Médecine Interne}, 
  year = {2024}, 
  month = mar
}

Emmanuelle Kempf, Sonia Priou, Basile Dura, Julien Calderaro, Clara Brones, Perceval Wajsbürt, Lina Bennani, Xavier Tannier.

Structuration des critères histopronostiques tumoraux par traitement automatique du langage naturel - Une comparaison entre apprentissage machine et règles.

in Congrès ÉMOIS, Special Issue of the Journal of Epidemiology and Population Health. March 2024.

ⓘ [BibTeX] [ScienceDirect Link]

Ce papier explore deux approches pour répondre aux questions à choix multiples (QCM) de pharmacie du défi DEFT 2024 en utilisant des modèles de langue (LLMs) entraînés sur des données ouvertes avec moins de 3 milliards de paramètres. Les deux approches reposent sur l'architecture RAG (Retrieval Augmented Generation) pour combiner la récupération de contexte à partir de bases de connaissances externes (NACHOS et Wikipédia) avec la génération de réponses par le LLM Apollo-2B. La première approche traite directement les QCMs et génère les réponses en une seule étape, tandis que la seconde approche reformule les QCMs en questions binaires (Oui/Non) puis génère une réponse pour chaque question binaire. Cette dernière approche obtient un Exact Match Ratio de 14.7 et un Hamming Score de 51.6 sur le jeu de test, ce qui démontre le potentiel du RAG pour des tâches de Q/A sous de telles contraintes.

@InProceedings{Kempf2024, 
  title = {{Structuration des critères histopronostiques tumoraux par traitement automatique du langage naturel - Une comparaison entre apprentissage machine et règles}},
  author = {Kempf, Emmanuelle and Priou, Sonia and Dura, Basile and Calderaro, Julien and Brones, Clara and Wajsbürt, Perceval and Bennani, Lina and Tannier, Xavier},
  booktitle = {Congrès ÉMOIS, Special Issue of the Journal of Epidemiology and Population Health}, 
  year = {2024}, 
  month = mar
}

Perceval Wajsburt, Xavier Tannier.

An end-to-end neural model based on cliques and scopes for frame extraction in long breast radiology reports.

in The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks. Toronto, Canada, pages 156–170, July 2023. © Association for Computational Linguistics.

ⓘ [abstract] [BibTeX] [ACL Anthology]

We consider the task of automatically extracting various overlapping frames, i.e, structured entities composed of multiple labels and mentions, from long clinical breast radiology documents. While many methods exist for related topics such as event extraction, slot filling, or discontinuous entity recognition, a challenge in our study resides in the fact that clinical reports typically contain overlapping frames that span multiple sentences or paragraphs.We propose a new method that addresses these difficulties and evaluate it on a new annotated corpus. Despite the small number of documents, we show that the hybridization between knowledge injection and a learning-based system allows us to quickly obtain proper results.We will also introduce the concept of scope relations and show that it both improves the performance of our system, and provides a visual explanation of the predictions.

@InProceedings{Wajsburt2023, 
  title = {{An end-to-end neural model based on cliques and scopes for frame extraction in long breast radiology reports}},
  author = {Wajsburt, Perceval and Tannier, Xavier},
  booktitle = {The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks}, 
  address = {Toronto, Canada}, 
  year = {2023}, 
  month = jul, 
  publisher = {Association for Computational Linguistics}, 
  pages = {156–170}
}

Nesrine Bannour, Bastien Rance, Xavier Tannier, Aurelie Neveol.

Event-independent temporal positioning: application to French clinical text.

in The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks. Toronto, Canada, pages 191–205, July 2023. © Association for Computational Linguistics.

ⓘ [abstract] [BibTeX] [ACL Anthology]

Extracting temporal relations usually entails identifying and classifying the relation between two mentions. However, the definition of temporal mentions strongly depends on the text type and the application domain. Clinical text in particular is complex. It may describe events that occurred at different times, contain redundant information and a variety of domain-specific temporal expressions. In this paper, we propose a novel event-independent representation of temporal relations that is task-independent and, therefore, domain-independent. We are interested in identifying homogeneous text portions from a temporal standpoint and classifying the relation between each text portion and the document creation time. Temporal relation extraction is cast as a sequence labeling task and evaluated on oncology notes. We further evaluate our temporal representation by the temporal positioning of toxicity events of chemotherapy administrated to colon and lung cancer patients described in French clinical reports. An overall macro F-measure of 0.86 is obtained for temporal relation extraction by a neural token classification model trained on clinical texts written in French. Our results suggest that the toxicity event extraction task can be performed successfully by automatically identifying toxicity events and placing them within the patient timeline (F-measure .62). The proposed system has the potential to assist clinicians in the preparation of tumor board meetings.

@InProceedings{Bannour2023b, 
  title = {{Event-independent temporal positioning: application to French clinical text}},
  author = {Bannour, Nesrine and Rance, Bastien and Tannier, Xavier and Neveol, Aurelie},
  booktitle = {The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks}, 
  address = {Toronto, Canada}, 
  year = {2023}, 
  month = jul, 
  publisher = {Association for Computational Linguistics}, 
  pages = {191–205}
}

Emmanuelle Kempf, Sonia Priou, Guillaume Lamé, Alexis Laurent, Etienne Guével, Stylianos Tzedakis, Romain Bey, David Fuks, Gilles Chatellier, Xavier Tannier, Gilles Galula, Rémi Flicoteaux, Christel Daniel, Christophe Tournigand.

No changes in clinical presentation, treatment strategies and survival of pancreatic cancer cases during the SARS-COV-2 outbreak: A retrospective multicenter cohort study on real-world data.

International Journal of Cancer. August 2023. doi: 10.1002/ijc.34675

ⓘ [abstract] [BibTeX] [Ask me!] [Direct link (Wiley)]

The SARS-COV-2 pandemic disrupted healthcare systems. We assessed its impact on the presentation, care trajectories and outcomes of new pancreatic cancers (PCs) in the Paris area. We performed a retrospective multicenter cohort study on the data warehouse of Greater Paris University Hospitals (AP-HP). We identified all patients newly referred with a PC between January 1, 2019, and June 30, 2021, and excluded endocrine tumors. Using claims data and health records, we analyzed the timeline of care trajectories, the initial tumor stage, the treatment categories: pancreatectomy, exclusive systemic therapy or exclusive best supportive care (BSC). We calculated patients' 1-year overall survival (OS) and compared indicators in 2019 and 2020 to 2021. We included 2335 patients. Referral fell by 29% during the first lockdown. The median time from biopsy and from first MDM to treatment were 25 days (16-50) and 21 days (11-40), respectively. Between 2019 and 2020 to 2021, the rate of metastatic tumors (36% vs 33%, P = .39), the pTNM distribution of the 464 cases with upfront tumor resection (P = .80), and the proportion of treatment categories did not vary: tumor resection (32% vs 33%), exclusive systemic therapy (49% vs 49%), exclusive BSC (19% vs 19%). The 1-year OS rates in 2019 vs 2020 to 2021 were 92% vs 89% (aHR = 1.42; 95% CI, 0.82-2.48), 52% vs 56% (aHR = 0.88; 95% CI, 0.73-1.08), 13% vs 10% (aHR = 1.00; 95% CI, 0.78-1.25), in the treatment categories, respectively. Despite an initial decrease in the number of new PCs, we did not observe any stage shift. OS did not vary significantly.

@Article{Kempf2023b, 
  title = {{No changes in clinical presentation, treatment strategies and survival of pancreatic cancer cases during the SARS-COV-2 outbreak: A retrospective multicenter cohort study on real-world data}},
  author = {Kempf, Emmanuelle and Priou, Sonia and Lamé, Guillaume and Laurent, Alexis and Guével, Etienne and Tzedakis, Stylianos and Bey, Romain and Fuks, David and Chatellier, Gilles and Tannier, Xavier and Galula, Gilles and Flicoteaux, Rémi and Daniel, Christel and Tournigand, Christophe},
  year = {2023}, 
  month = aug, 
  journal = {International Journal of Cancer}, 
  doi = {10.1002/ijc.34675}
}

Emmanuelle Kempf, Morgan Vaterkowski, Damien Leprovost, Nicolas Griffon, David Ouagne, Stéphane Bréant, Patricia Serre, Alexandre Mouchet, Bastien Rance, Gilles Chatellier, Ali Bellamine, Marie Frank, Julien Guerin, Xavier Tannier, Alain Livartowski, Martin Hilka, Christel Daniel.

How to Improve Cancer Patients ENrollment in Clinical Trials From rEal-Life Databases Using the Observational Medical Outcomes Partnership Oncology Extension: Results of the PENELOPE Initiative in Urologic Cancers.

JCO Clinical Cancer Informatics. 7, May 2023. doi: 10.1200/CCI.22.00179

ⓘ [abstract] [BibTeX] [Ask me!] [Direct link]

Purpose: To compare the computability of Observational Medical Outcomes Partnership (OMOP)-based queries related to prescreening of patients using two versions of the OMOP common data model (CDM; v5.3 and v5.4) and to assess the performance of the Greater Paris University Hospital (APHP) prescreening tool.
Materials and methods: We identified the prescreening information items being relevant for prescreening of patients with cancer. We randomly selected 15 academic and industry-sponsored urology phase I-IV clinical trials (CTs) launched at APHP between 2016 and 2021. The computability of the related prescreening criteria (PC) was defined by their translation rate in OMOP-compliant queries and by their execution rate on the APHP clinical data warehouse (CDW) containing data of 205,977 patients with cancer. The overall performance of the prescreening tool was assessed by the rate of true- and false-positive cases of three randomly selected CTs.
Results: We defined a list of 15 minimal information items being relevant for patients' prescreening. We identified 83 PC of the 534 eligibility criteria from the 15 CTs. We translated 33 and 62 PC in queries on the basis of OMOP CDM v5.3 and v5.4, respectively (translation rates of 40% and 75%, respectively). Of the 33 PC translated in the v5.3 of the OMOP CDM, 19 could be executed on the APHP CDW (execution rate of 58%). Of 83 PC, the computability rate on the APHP CDW reached 23%. On the basis of three CTs, we identified 17, 32, and 63 patients as being potentially eligible for inclusion in those CTs, resulting in positive predictive values of 53%, 41%, and 21%, respectively.
Conclusion: We showed that PC could be formalized according to the OMOP CDM and that the oncology extension increased their translation rate through better representation of cancer natural history.

@Article{Kempf2023a, 
  title = {{How to Improve Cancer Patients ENrollment in Clinical Trials From rEal-Life Databases Using the Observational Medical Outcomes Partnership Oncology Extension: Results of the PENELOPE Initiative in Urologic Cancers}},
  author = {Emmanuelle Kempf and Morgan Vaterkowski and Damien Leprovost and Nicolas Griffon and David Ouagne and Stéphane Bréant and Patricia Serre and Alexandre Mouchet and Bastien Rance and Gilles Chatellier and Ali Bellamine and Marie Frank and Julien Guerin and Xavier Tannier and Alain Livartowski and Martin Hilka and Christel Daniel},
  year = {2023}, 
  month = may, 
  journal = {JCO Clinical Cancer Informatics}, 
  volume = {7}, 
  doi = {10.1200/CCI.22.00179}
}

Marco Naguib, Aurélie Névéol, Xavier Tannier.

Stratégies d'apprentissage actif pour la reconnaissance d'entités nommées en français.

in Actes de la 30ème conférence Traitement Automatique des Langues Naturelles (TALN 2023). Paris, France, June 2023.

ⓘ [abstract] [BibTeX] [pdf]

L'annotation manuelle de corpus est un processus coûteux et lent, notamment pour la tâche de re-connaissance d'entités nommées. L'apprentissage actif vise à rendre ce processus plus efficace, ensélectionnant les portions les plus pertinentes à annoter. Certaines stratégies visent à sélectionner lesportions les plus représentatives du corpus, d'autres, les plus informatives au modèle de langage.Malgré un intérêt grandissant pour l'apprentissage actif, rares sont les études qui comparent cesdifférentes stratégies dans un contexte de reconnaissance d'entités nommées médicales. Nous pro-posons une comparaison de ces stratégies en fonction des performances de chacune sur 3 corpus dedocuments cliniques en langue française : MERLOT, QuaeroFrenchMed et E3C. Nous comparonsles stratégies de sélection mais aussi les différentes façons de les évaluer. Enfin, nous identifions lesstratégies qui semblent les plus efficaces et mesurons l'amélioration qu'elles présentent, à différentesphases de l'apprentissage.

@InProceedings{Naguib2023, 
  title = {{Stratégies d'apprentissage actif pour la reconnaissance d'entités nommées en français}},
  author = {Naguib, Marco and Névéol, Aurélie and Tannier, Xavier},
  booktitle = {Actes de la 30ème conférence Traitement Automatique des Langues Naturelles (TALN 2023)}, 
  address = {Paris, France}, 
  year = {2023}, 
  month = jun
}

Nesrine Bannour, Xavier Tannier, Bastien Rance, Aurélie Névéol.

Positionnement temporel indépendant des évènements : application à des textes cliniques en français.

in Actes de la 30ème conférence Traitement Automatique des Langues Naturelles (TALN 2023). Paris, France, June 2023.

ⓘ [abstract] [BibTeX] [pdf]

L'extraction de relations temporelles consiste à identifier et classifier la relation entre deux mentions. Néanmoins, la définition des mentions temporelles dépend largement du type du texte et du domained'application. En particulier, le texte clinique est complexe car il décrit des évènements se produisant à des moments différents et contient des informations redondantes et diverses expressions temporellesspécifiques au domaine. Dans cet article, nous proposons une nouvelle représentation des relations temporelles, qui est indépendante du domaine et de l'objectif de la tâche d'extraction. Nous nousintéressons à extraire la relation entre chaque portion du texte et la date de création du document. Nous formulons l'extraction de relations temporelles comme une tâche d'étiquetage de séquences.Une macro F-mesure de 0,8 est obtenue par un modèle neuronal entraîné sur des textes cliniques, écrits en français. Nous évaluons notre représentation temporelle par le positionnement temporel desévènements de toxicité des chimiothérapies.

@InProceedings{Bannour2023, 
  title = {{Positionnement temporel indépendant des évènements : application à des textes cliniques en français}},
  author = {Bannour, Nesrine and Tannier, Xavier and Rance, Bastien and Névéol, Aurélie},
  booktitle = {Actes de la 30ème conférence Traitement Automatique des Langues Naturelles (TALN 2023)}, 
  address = {Paris, France}, 
  year = {2023}, 
  month = jun
}

Christel Gérardin, Arthur Mageau, Arsène Mékinian, Xavier Tannier, Fabrice Carrat.

Construction of Cohorts of Similar Patients From Automatic Extraction of Medical Concepts: Phenotype Extraction Study.

JMIR Medical Informatics. Vol. 10, Issue 12, December 2022. doi: 10.2196/42379

ⓘ [abstract] [BibTeX] [JMIR link]

Background: Reliable and interpretable automatic extraction of clinical phenotypes from large electronic medical record databases remains a challenge, especially in a language other than English.
Objective: We aimed to provide an automated end-to-end extraction of cohorts of similar patients from electronic health records for systemic diseases.
Methods: Our multistep algorithm includes a named-entity recognition step, a multilabel classification using medical subject headings ontology, and the computation of patient similarity. A selection of cohorts of similar patients on a priori annotated phenotypes was performed. Six phenotypes were selected for their clinical significance: P1, osteoporosis; P2, nephritis in systemic erythematosus lupus; P3, interstitial lung disease in systemic sclerosis; P4, lung infection; P5, obstetric antiphospholipid syndrome; and P6, Takayasu arteritis. We used a training set of 151 clinical notes and an independent validation set of 256 clinical notes, with annotated phenotypes, both extracted from the Assistance Publique-Hôpitaux de Paris data warehouse. We evaluated the precision of the 3 patients closest to the index patient for each phenotype with precision-at-3 and recall and average precision.
Results: For P1-P4, the precision-at-3 ranged from 0.85 (95% CI 0.75-0.95) to 0.99 (95% CI 0.98-1), the recall ranged from 0.53 (95% CI 0.50-0.55) to 0.83 (95% CI 0.81-0.84), and the average precision ranged from 0.58 (95% CI 0.54-0.62) to 0.88 (95% CI 0.85-0.90). P5-P6 phenotypes could not be analyzed due to the limited number of phenotypes.
Conclusions: Using a method close to clinical reasoning, we built a scalable and interpretable end-to-end algorithm for extracting cohorts of similar patients.

@Article{Gérardin2022b, 
  title = {{Construction of Cohorts of Similar Patients From Automatic Extraction of Medical Concepts: Phenotype Extraction Study}},
  author = {Gérardin, Christel and Mageau, Arthur and Mékinian, Arsène and Tannier, Xavier and Carrat, Fabrice},
  number = {12}, 
  year = {2022}, 
  month = dec, 
  journal = {JMIR Medical Informatics}, 
  volume = {10}, 
  doi = {10.2196/42379}
}

Christel Gérardin, Perceval Wajsbürt, Pascal Vaillant, Ali Bellamine, Fabrice Carrat, Xavier Tannier.

Multilabel classification of medical concepts for patient clinical profile identification.

Artificial Intelligence in Medicine. 128, June 2022. doi: 10.1016/j.artmed.2022.102311

ⓘ [abstract] [BibTeX] [Ask me!] [ScienceDirect link]

Highlights

Extracting key informations from clinical narratives is a NLP Challenge.
There is a particular need to improve NLP tasks in languages other than English.
Our approach allows automatic pathological domains detection from clinical notes.
Using multilingual vocabularies and multilingual model leads to better results.

Abstract
Background: The development of electronic health records has provided a large volume of unstructured biomedical information. Extracting patient characteristics from these data has become a major challenge, especially in languages other than English.
Objective: We developed a methodology able to explore and target topics of interest via an interactive user interface for health professionals with limited computer science knowledge. We aim to reach near state-of-the-art performance while reducing memory consumption, increasing scalability, and minimizing user interaction effort to improve the clinical decision-making process. The performance was evaluated on diabetes-related abstracts from PubMed.
Methods: Inspired by the French Text Mining Challenge (DEFT 2021) [1] in which we participated, our study proposes a multilabel classification of clinical narratives, allowing us to automatically extract the main features of a patient report. Our system is an end-to-end pipeline from raw text to labels with two main steps: named entity recognition and multilabel classification. Both steps are based on a neural network architecture based on transformers. To train our final classifier, we extended the dataset with all English and French Unified Medical Language System (UMLS) vocabularies related to human diseases. We focus our study on the multilingualism of training resources and models, with experiments combining French and English in different ways (multilingual embeddings or translation).
Results: We obtained an overall average micro-F1 score of 0.811 for the multilingual version, 0.807 for the French-only version and 0.797 for the translated version.
Conclusion: Our study proposes an original multilabel classification of French clinical notes for patient phenotyping. We show that a multilingual algorithm trained on annotated real clinical notes and UMLS vocabularies leads to the best results.

@Article{Gérardin2022, 
  title = {{Multilabel classification of medical concepts for patient clinical profile identification}},
  author = {Gérardin, Christel and Wajsbürt, Perceval and Vaillant, Pascal and Bellamine, Ali and Carrat, Fabrice and Tannier, Xavier},
  year = {2022}, 
  month = jun, 
  journal = {Artificial Intelligence in Medicine}, 
  volume = {128}, 
  doi = {10.1016/j.artmed.2022.102311}
}

Nesrine Bannour, Perceval Wajsbürt, Bastien Rance, Xavier Tannier, Aurélie Névéol.

Privacy-Preserving Mimic Models for clinical Named Entity Recognition in French.

Journal of Biomedical Informatics. 130, June 2022. doi: 10.1016/j.jbi.2022.104073

ⓘ [abstract] [BibTeX] [Ask me!] [ScienceDirect link]

Highlights

We propose Privacy-Preserving Mimic Models for clinical named entity recognition.
Models are trained without processing any sensitive data or private model weights.
Mimic models achieve up to 0.706 macro exact F-measure on 15 clinical entity types.
Our approach offers a good compromise between performance and privacy preservation.

Abstract
A vast amount of crucial information about patients resides solely in unstructured clinical narrative notes. There has been a growing interest in clinical Named Entity Recognition (NER) task using deep learning models. Such approaches require sufficient annotated data. However, there is little publicly available annotated corpora in the medical field due to the sensitive nature of the clinical text. In this paper, we tackle this problem by building privacy-preserving shareable models for French clinical Named Entity Recognition using the mimic learning approach to enable the knowledge transfer through a teacher model trained on a private corpus to a student model. This student model could be publicly shared without any access to the original sensitive data. We evaluated three privacy-preserving models using three medical corpora and compared the performance of our models to those of baseline models such as dictionary-based models. An overall macro F-measure of 70.6% could be achieved by a student model trained using silver annotations produced by the teacher model, compared to 85.7% for the original private teacher model. Our results revealed that these privacy-preserving mimic learning models offer a good compromise between performance and data privacy preservation.

@Article{Bannour2022, 
  title = {{Privacy-Preserving Mimic Models for clinical Named Entity Recognition in French}},
  author = {Bannour, Nesrine and Wajsbürt, Perceval and Rance, Bastien and Tannier, Xavier and Névéol, Aurélie},
  year = {2022}, 
  month = jun, 
  journal = {Journal of Biomedical Informatics}, 
  volume = {130}, 
  doi = {10.1016/j.jbi.2022.104073}
}

Adrian Ahne, Guy Fagherazzi, Xavier Tannier, Thomas Czernichow, Francisco Orchard.

Improving Diabetes-Related Biomedical Literature Exploration in the Clinical Decision-making Process via Interactive Classification and Topic Discovery: Methodology Development Study.

Journal of Medical Internet Research. Vol. 24, Issue 1, January 2022. doi: 10.2196/27434

ⓘ [abstract] [BibTeX] [JMIR link]

Background:The amount of available textual health data such as scientific and biomedical literature is constantly growing and becoming more and more challenging for health professionals to properly summarize those data and practice evidence-based clinical decision making. Moreover, the exploration of unstructured health text data is challenging for professionals without computer science knowledge due to limited time, resources, and skills. Current tools to explore text data lack ease of use, require high computational efforts, and incorporate domain knowledge and focus on topics of interest with difficulty.
Objective:We developed a methodology able to explore and target topics of interest via an interactive user interface for health professionals with limited computer science knowledge. We aim to reach near state-of-the-art performance while reducing memory consumption, increasing scalability, and minimizing user interaction effort to improve the clinical decision-making process. The performance was evaluated on diabetes-related abstracts from PubMed.
Methods:The methodology consists of 4 parts: (1) a novel interpretable hierarchical clustering of documents where each node is defined by headwords (words that best represent the documents in the node), (2) an efficient classification system to target topics, (3) minimized user interaction effort through active learning, and (4) a visual user interface. We evaluated our approach on 50,911 diabetes-related abstracts providing a hierarchical Medical Subject Headings (MeSH) structure, a unique identifier for a topic. Hierarchical clustering performance was compared against the implementation in the machine learning library scikit-learn. On a subset of 2000 randomly chosen diabetes abstracts, our active learning strategy was compared against 3 other strategies: random selection of training instances, uncertainty sampling that chooses instances about which the model is most uncertain, and an expected gradient length strategy based on convolutional neural networks (CNNs).
Results:For the hierarchical clustering performance, we achieved an F1 score of 0.73 compared to 0.76 achieved by scikit-learn. Concerning active learning performance, after 200 chosen training samples based on these strategies, the weighted F1 score of all MeSH codes resulted in a satisfying 0.62 F1 score using our approach, 0.61 using the uncertainty strategy, 0.63 using the CNN, and 0.45 using the random strategy. Moreover, our methodology showed a constant low memory use with increased number of documents.
Conclusions:We proposed an easy-to-use tool for health professionals with limited computer science knowledge who combine their domain knowledge with topic exploration and target specific topics of interest while improving transparency. Furthermore, our approach is memory efficient and highly parallelizable, making it interesting for large Big Data sets. This approach can be used by health professionals to gain deep insights into biomedical literature to ultimately improve the evidence-based clinical decision making process.

@Article{Ahne2022, 
  title = {{Improving Diabetes-Related Biomedical Literature Exploration in the Clinical Decision-making Process via Interactive Classification and Topic Discovery: Methodology Development Study}},
  author = {Ahne, Adrian and Fagherazzi, Guy and Tannier, Xavier and Czernichow, Thomas and Orchard, Francisco},
  number = {1}, 
  year = {2022}, 
  month = jan, 
  journal = {Journal of Medical Internet Research}, 
  volume = {24}, 
  doi = {10.2196/27434}
}

Nesrine Bannour, Perceval Wajsbürt, Bastien Rance, Xavier Tannier, Aurélie Névéol.

Modèles préservant la confidentialité des données par mimétisme pour la reconnaissance d’entités nommées en français.

in Actes de la journée d’étude sur la robustesse des systemes de TAL. Paris, France, December 2022.

ⓘ [BibTeX] [Link to free copy]

Les données de vie réelle suscitent un intérêt croissant des agences sanitaires à travers le monde, que ce soit pour étudier l’usage, l’efficacité et la sécurité des produits de santé, suivre et améliorer la qualité des soins, réaliser des études épidémiologiques ou faciliter la veille sanitaire. Parmi les différentes sources de données, les entrepôts de données de santé hospitaliers (EDSH) connaissent actuellement un développement rapide sur le territoire français. Dans la perspective de mobiliser ces données dans le cadre de ses missions, la Haute Autorité de santé (HAS) a souhaité mieux comprendre cette dynamique et le potentiel de ces données. Elle a initié en novembre 2021 un travail de recherche visant à dresser un état des lieux des EDSH en France. Ce rapport pose d’abord le cadre et des définitions en s’appuyant sur la littérature. Il détaille ensuite la méthodologie de recherche, fondée sur des entretiens menés auprès d’acteurs impliqués dans les EDSH de 17 CHU et 5 autres établissements hospitaliers. Le résultat de ces entretiens est structuré par thématiques : historique, gouvernance, données intégrées, usages couverts, transparence, architecture technique et qualité de la donnée. Ce rapport discute ensuite les points d’attention identifiés pour le bon développement des EDSH et des usages secondaires des données. Il ébauche enfin deux cas d’usages pertinents pour la HAS.

@InProceedings{Bannour2022b, 
  title = {{Modèles préservant la confidentialité des données par mimétisme pour la reconnaissance d’entités nommées en français}},
  author = {Bannour, Nesrine and Wajsbürt, Perceval and Rance, Bastien and Tannier, Xavier and Névéol, Aurélie},
  booktitle = {Actes de la journée d’étude sur la robustesse des systemes de TAL}, 
  address = {Paris, France}, 
  year = {2022}, 
  month = dec
}

Perceval Wajsbürt, Arnaud Sarfati, Xavier Tannier.

Medical concept normalization in French using multilingual terminologies and contextual embeddings.

Journal of Biomedical Informatics. 114, January 2021. doi: 10.1016/j.jbi.2021.103684

ⓘ [abstract] [BibTeX] [Ask me!] [ScienceDirect link]

Highlights

We train a model to normalize medical entities in French with a very large list of concepts.
Our method is a neural network model that requires no prior translation.
Multilingual training data improves the performance of medical normalization in French.
Multilingual embeddings are of less importance than multilingual data.

Introduction: Concept normalization is the task of linking terms from textual medical documents to their concept in terminologies such as the UMLS®. Traditional approaches to this problem depend heavily on the coverage of available resources, which poses a problem for languages other than English.
Objective: We present a system for concept normalization in French. We consider textual mentions already extracted and labeled by a named entity recognition system, and we classify these mentions with a UMLS concept unique identifier. We take advantage of the multilingual nature of available terminologies and embedding models to improve concept normalization in French without translation nor direct supervision.
Materials and methods: We consider the task as a highly-multiclass classification problem. The terms are encoded with contextualized embeddings and classified via cosine similarity and softmax. A first step uses a subset of the terminology to finetune the embeddings and train the model. A second step adds the entire target terminology, and the model is trained further with hard negative selection and softmax sampling.
Results: On two corpora from the Quaero FrenchMed benchmark, we show that our approach can lead to good results even with no labeled data at all; and that it outperforms existing supervised methods with labeled data.
Discussion: Training the system with both French and English terms improves by a large margin the performance of the system on a French benchmark, regardless of the way the embeddings were pretrained (French, English, multilingual). Our distantly supervised method can be applied to any kind of documents or medical domain, as it does not require any concept-labeled documents.
Conclusion: These experiments pave the way for simpler and more effective multilingual approaches to processing medical texts in languages other than English.

@Article{Wajsburt2021, 
  title = {{Medical concept normalization in French using multilingual terminologies and contextual embeddings}},
  author = {Wajsbürt, Perceval and Sarfati, Arnaud and Tannier, Xavier},
  year = {2021}, 
  month = jan, 
  journal = {Journal of Biomedical Informatics}, 
  volume = {114}, 
  doi = {10.1016/j.jbi.2021.103684}
}

Pierre Chastang, Sergio Torres Aguilar, Xavier Tannier.

A Named Entity Recognition Model for Medieval Latin Charters.

Digital Humanities Quarterly. Vol. 15, Issue 4, November 2021.

ⓘ [abstract] [BibTeX] [DHQ free link]

Named entity recognition is an advantageous technique with an increasing presence in digital humanities. In theory, automatic detection and recovery of named entities can provide new ways of looking up unedited information in edited sources and can allow the parsing of a massive amount of data in a short time for supporting historical hypotheses. In this paper, we detail the implementation of a model for automatic named entity recognition in medieval Latin sources and we test its robustness on different datasets. Different models were trained on a vast dataset of Burgundian diplomatic charters from the 9th to 14th centuries and validated by using general and century ad hoc models tested on short sets of Parisian, English, Italian and Spanish charters. We present the results of cross-validation in each case and we discuss the implications of these results for the history of medieval place-names and personal names.

@Article{Chastang2021, 
  title = {{A Named Entity Recognition Model for Medieval Latin Charters}},
  author = {Chastang, Pierre and Torres Aguilar, Sergio and Tannier, Xavier},
  number = {4}, 
  year = {2021}, 
  month = nov, 
  journal = {Digital Humanities Quarterly}, 
  volume = {15}
}

Perceval Wajsbürt, Yoann Taillé, Xavier Tannier.

Effect of depth order on iterative nested named entity recognition models.

in Conference on Artificial Intelligence in Medecine (AIME 2021). Porto, Portugal, June 2021.

ⓘ [abstract] [BibTeX] [Long version on arXiv]

This paper studies the effect of the order of depth of mention on nested named entity recognition (NER) models. NER is an essential task in the extraction of biomedical information, and nested entities are common since medical concepts can assemble to form larger entities. Conventional NER systems only predict disjointed entities. Thus, iterative models for nested NER use multiple predictions to enumerate all entities, imposing a predefined order from largest to smallest or smallest to largest. We design an order-agnostic iterative model and a procedure to choose a custom order during training and prediction. We propose a modification of the Transformer architecture to take into account the entities predicted in the previous steps. We provide a set of experiments to study the model capabilities and the effects of the order on performance. Finally, we show that the smallest to largest order gives the best results.

@InProceedings{Wajsburt2021b, 
  title = {{Effect of depth order on iterative nested named entity recognition models}},
  author = {Perceval Wajsbürt and Yoann Taillé and Xavier Tannier},
  booktitle = {Conference on Artificial Intelligence in Medecine (AIME 2021)}, 
  address = {Porto, Portugal}, 
  year = {2021}, 
  month = jun
}

Christel Gérardin, Pascal Vaillant, Perceval Wajsbürt, Clément Gilavert, Ali Bellamine, Emmanuelle Kempf, Xavier Tannier.

Classification multilabel de concepts médicaux pour l’identification du profil clinique du patient.

in Défi Fouille de Texte (DEFT), Traitement Automatique des Langues Naturelles, 2021. Lille, France, June 2021.

ⓘ [abstract] [BibTeX] [HAL link]

La première tâche du Défi fouille de textes 2021 a consisté à extraire automatiquement, à partir de cas cliniques, les phénotypes pathologiques des patients regroupés par tête de chapitre du MeSH-maladie. La solution présentée est celle d’un classifieur multilabel basé sur un transformer. Deux transformers ont été utilisés : le camembert-large classique (run 1) et le camembert-large fine-tuné (run 2) sur des articles biomédicaux français en accès libre. Nous avons également proposé un modèle « bout-enbout », avec une première phase d’extraction d’entités nommées également basée sur un transformer de type camembert-large et un classifieur de genre sur un modèle Adaboost. Nous obtenons un très bon rappel et une précision correcte, pour une F1-mesure autour de 0,77 pour les trois runs. La performance du modèle « bout-en-bout » est similaire aux autres méthodes.

@InProceedings{Gerardin2021, 
  title = {{Classification multilabel de concepts médicaux pour l’identification du profil clinique du patient}},
  author = {Christel Gérardin and Pascal Vaillant and Perceval Wajsbürt and Clément Gilavert and Ali Bellamine and Emmanuelle Kempf and Xavier Tannier},
  booktitle = {Défi Fouille de Texte (DEFT), Traitement Automatique des Langues Naturelles, 2021}, 
  address = {Lille, France}, 
  year = {2021}, 
  month = jun
}

Ali Bellamine, Christel Daniel, Perceval Wajsbürt, Christian Roux, Xavier Tannier, Karine Briot.

Identification automatique des patients avec fractures ostéoporotiques à partir de comptes rendus médicaux.

in 34e Congrès Français de Rhumatologie. Paris, France, December 2021.

ⓘ [abstract] [BibTeX] [ScienceDirect]

Introduction Les fractures ostéoporotiques sont associées à un excès de morbi-mortalité. La mise en œuvre de parcours de soins de type filière fracture est efficace pour réduire le risque de nouvelle fracture et l’excès de mortalité. La mobilisation de ressources humaines et la difficulté à identifier les patients éligibles est l’une des limites à la mise en place et au fonctionnement de ces filières. L’objectif de l’étude est de développer et valider un outil de détection automatique permettant d’identifier les fractures ostéoporotiques chez les sujets de plus de 50 ans à partir de comptes rendus médicaux.
Patients et méthodesLe développement de l’outil de détection automatique s’appuie sur une chaîne de traitement d’algorithmes utilisant des techniques de traitement automatique du langage et de d’apprentissage automatique (Natural Language processing, Machine Learning and Rule-based solutions). Le développement de l’outil et sa validation ont été réalisés à partir des comptes rendus médicaux des départements des urgences et d’orthopédie de l’entrepôt de données de santé (EDS) de l’Assistance publique–Hôpitaux de Paris (AP–HP). L’outil a été développé à partir d’un échantillon aléatoire de 4917 documents issus d’un centre hospitalier. Les documents qui ont servi aux développements des algorithmes sont différents de ceux qui ont servi à leurs entraînements. La validation externe a été réalisée sur l’ensemble des comptes rendus médicaux d’orthopédie et des urgences recueillies en 3 mois dans l’EDS soit 154 031 documents. Les performances de l’outil (Sensibilité Se, Spécificité Sp, valeur prédictive positive VPP, valeur prédictive négative VPN) ont été calculées pour le développement et la validation de l’outil.
RésultatsL’outil a été développé à partir de 3913 documents des Urgences et 1004 documents d’orthopédie. Les performances des différents algorithmes conduisant à l’outil sont : Se comprise entre 80 et 93 %, Sp entre 62 et 99 %, VPP entre 90 et 96 % et VPN entre 69 et 99 %. L’outil a été validé dans une base de 154 031 documents (148 423 des urgences et 5608 d’orthopédie) (46 % de femmes, âge moyen 67 ans). L’outil a permis d’identifier 4 % de documents des urgences avec fracture susceptible d’être ostéoporotique (n = 5806) et 27 % des documents d’orthopédie (n = 1503), soit une population âgée de 74 ans en moyenne avec 68 % de femmes. Une validation manuelle par un expert a été réalisée sur 1000 documents avec fracture identifiée et 1000 documents sans fracture, sélectionnés au hasard. Les Se, Sp, VPP et VPN sont de 68 %, 100 %, 78 % et 99 % pour les comptes rendus des urgences et 84 %, 97 %, 92 % et 93 % pour les comptes rendus d’orthopédie.
ConclusionCette étude est le premier travail montrant qu’un outil d’identification automatique basé sur le traitement automatique du langage et d’apprentissage automatique permet d’identifier des patients avec des fractures susceptibles d’être ostéoporotique sur des comptes médicaux des urgences et d’orthopédie. Les performances de l’outil sont bonnes et permettent de répondre au besoin d’assistance à l’identification des patients dans le cadre de parcours de soins post fracture.

@InProceedings{Bellamine2021, 
  title = {{Identification automatique des patients avec fractures ostéoporotiques à partir de comptes rendus médicaux}},
  author = {Ali Bellamine and Christel Daniel and Perceval Wajsbürt and Christian Roux and Xavier Tannier and Karine Briot},
  booktitle = {34e Congrès Français de Rhumatologie}, 
  address = {Paris, France}, 
  year = {2021}, 
  month = dec
}

Nesrine Bannour, Aurélie Névéol, Xavier Tannier, Bastien Rance.

Traitement Automatique de la Langue et Intégration de Données pour les Réunions de Concertations Pluridisciplinaires en Oncologie.

in Journée AFIA/ATALA "la santé et le langage". February 2021.

ⓘ [abstract] [BibTeX] [Link]

Les réunions de concertations pluridisciplinaires (RCP) en oncologie permettent aux experts desdifférentes spécialités de choisir les meilleures options thérapeutiques pour les patients. Les donnéesnécessaires à ces réunions sont souvent collectées manuellement, avec un risque d’erreur lors del’extraction et un coût important pour les professionnels de santé. Plusieurs travaux scientifiquesportant sur des documents en anglais se sont intéressés à l’extraction automatique d’informations(telles que la localisation de la tumeur, les classifications histologiques, TNM, ...) dans les rapportscliniques des dossiers médicaux. Dans le cadre du projet ASIMOV (ASsIster la recherche en oncologie par le Machine Learning, l’intégration de dOnnées et la Visualisation), nous utiliserons le traitement automatique de la langue et l’intégrationde données pour l’extraction d’informations liées au cancer dans les entrepôts de données et les textescliniques en français.

@InProceedings{Bannour2021, 
  title = {{Traitement Automatique de la Langue et Intégration de Données pour les Réunions de Concertations Pluridisciplinaires en Oncologie}},
  author = {Bannour, Nesrine and Névéol, Aurélie and Tannier, Xavier and Rance, Bastien},
  booktitle = {Journée AFIA/ATALA "la santé et le langage"}, 
  year = {2021}, 
  month = feb
}

Julien Tourille, Olivier Ferret, Aurélie Névéol, Xavier Tannier.

Modèle neuronal pour la résolution de la coréférence dans les dossiers médicaux électroniques.

in Actes de la 27ème conférence Traitement Automatique des Langues Naturelles (TALN 2020). Nancy, France, June 2020.

ⓘ [abstract] [BibTeX] [HAL link]

La résolution de la coréférence est un élément essentiel pour la constitution automatique de chronologies médicales à partir des dossiers médicaux électroniques. Dans ce travail, nous présentons une approche neuronale pour la résolution de la coréférence dans des textes médicaux écrits en anglais pour les entités générales et cliniques en nous évaluant dans le cadre de référence pour cette tâche que constitue la tâche 1C de la campagne i2b2 2011.

@InProceedings{Tourille2020, 
  title = {{Modèle neuronal pour la résolution de la coréférence dans les dossiers médicaux électroniques}},
  author = {Tourille, Julien and Ferret, Olivier and Névéol, Aurélie and Tannier, Xavier},
  booktitle = {Actes de la 27ème conférence Traitement Automatique des Langues Naturelles (TALN 2020)}, 
  address = {Nancy, France}, 
  year = {2020}, 
  month = jun
}

Perceval Wajsbürt, Yoann Taillé, Guillaumé Lainé, Xavier Tannier.

Participation de l'équipe du LIMICS à DEFT 2020.

in Défi Fouille de Texte (DEFT) 2020. Nancy, France, June 2020.

ⓘ [abstract] [BibTeX] [HAL link]

Nous présentons dans cet article les méthodes conçues et les résultats obtenus lors de notre participation à la tâche 3 de la campagne d'évaluation DEFT 2020, consistant en la reconnaissance d'entités nommées du domaine médical. Nous proposons deux modèles différents permettant de prendre en compte les entités imbriquées, qui représentent une des difficultés du jeu de données proposées, et présentons les résultats obtenus. Notre meilleur run obtient la meilleure performance parmi les participants, sur l'une des deux sous-tâches du défi.

@InProceedings{Wajsburt2020, 
  title = {{Participation de l'équipe du LIMICS à DEFT 2020}},
  author = {Perceval Wajsbürt and Yoann Taillé and Guillaumé Lainé and Xavier Tannier},
  booktitle = {Défi Fouille de Texte (DEFT) 2020}, 
  address = {Nancy, France}, 
  year = {2020}, 
  month = jun
}

Xavier Tannier, Nicolas Paris, Hugo Cisneros, Christel Daniel, Matthieu Doutreligne, Catherine Duclos, Nicolas Griffon, Claire Hassen-Khodja, Ivan Lerner, Adrien Parrot, Éric Sadou, Cyril Saussol, Pascal Vaillant.

Hybrid Approaches for our Participation to the n2c2 Challenge on Cohort Selection for Clinical Trials.

March 2019.

arXiv

ⓘ [abstract] [BibTeX] [arXiv]

Objective: Natural language processing can help minimize human interventionin identifying patients meeting eligibility criteria for clinical trials, butthere is still a long way to go to obtain a general and systematic approachthat is useful for researchers. We describe two methods taking a step in thisdirection and present their results obtained during the n2c2 challenge oncohort selection for clinical trials.Materials and Methods: The first methodis a weakly supervised method using an unlabeled corpus (MIMIC) to build asilver standard, by producing semi-automatically a small and very precise setof rules to detect some samples of positive and negative patients. This silverstandard is then used to train a traditional supervised model. The secondmethod is a terminology-based approach where a medical expert selects theappropriate concepts, and a procedure is defined to search the terms and checkthe structural or temporal constraints.Results: On the n2c2 dataset containing annotated data about 13 selection criteria on 288 patients, we obtained anoverall F1-measure of 0.8969, which is the third best result out of 45participant teams, with no statistically significant difference with thebest-ranked team.Discussion: Both approaches obtained very encouraging resultsand apply to different types of criteria. The weakly supervised method requiresexplicit descriptions of positive and negative examples in some reports. Theterminology-based method is very efficient when medical concepts carry most ofthe relevant information.Conclusion: It is unlikely that much more annotateddata will be soon available for the task of identifying a wide range of patientphenotypes. One must focus on weakly or non-supervised learning methods usingboth structured and unstructured data and relying on a comprehensiverepresentation of the patients.

@Misc{Tannier2019, 
  title = {{Hybrid Approaches for our Participation to the n2c2 Challenge on Cohort Selection for Clinical Trials}},
  author = {Xavier Tannier and Nicolas Paris and Hugo Cisneros and Christel Daniel and Matthieu Doutreligne and Catherine Duclos and Nicolas Griffon and Claire Hassen-Khodja and Ivan Lerner and Adrien Parrot and Éric Sadou and Cyril Saussol and Pascal Vaillant},
  year = {2019}, 
  month = mar, 
  note = {arXiv}
}

Charlotte Rudnik, Thibault Ehrhart, Olivier Ferret, Denis Teyssou, Raphaël Troncy, Xavier Tannier.

Searching News Articles Using an Event Knowledge Graph Leveraged by Wikidata.

in Proceedings of the Wiki Workshop 2019 (The Web Conference). San Francisco, USA, May 2019.

ⓘ [abstract] [BibTeX] [arXiv]

News agencies produce thousands of multimedia stories describing events happening in the world that are either scheduled such as sports competitions, political summits and elections, or breaking events such as military conflicts, terrorist attacks, natural disasters, etc. When writing up those stories, journalists refer to contextual background and to compare with past similar events. However, searching for precise facts described in stories is hard. In this paper, we propose a general method that leverages the Wikidata knowledge base to produce semantic annotations of news articles. Next, we describe a semantic search engine that supports both keyword based search in news articles and structured data search providing filters for properties belonging to specific event schemas that are automatically inferred.

@InProceedings{Rudnik2019, 
  title = {{Searching News Articles Using an Event Knowledge Graph Leveraged by Wikidata}},
  author = {Rudnik, Charlotte and Ehrhart, Thibault and Ferret, Olivier and Teyssou, Denis and Troncy, Raphaël and Tannier, Xavier},
  booktitle = {Proceedings of the Wiki Workshop 2019 (The Web Conference)}, 
  address = {San Francisco, USA}, 
  year = {2019}, 
  month = may
}

Nicolas Paris, Matthieu Doutreligne, Adrien Parrot, Xavier Tannier.

Désidentification de comptes-rendus hospitaliers dans une base de données OMOP.

in Actes de TALMED 2019 : Symposium satellite francophone sur le traitement automatique des langues dans le domaine biomédical. Lyon, France, August 2019.

ⓘ [abstract] [BibTeX] [paper]

En médecine, la recherche sur les données de patients vise à améliorer les soins. Pour préserver la vie privée des patients, ces données sont usuellement désidentifiées. Les documents textuels contiennent de nombreuses informa-tions présentes uniquement dans ce matériel et représentent donc un attrait important pour la recherche. Cependant ils représentent aussi un challenge technique lié au processus de désidentification. Ce travail propose une méthode hybride de désidentification évaluée sur un échantillon des textes de l'entrepôt de données de santé de l'Assistance Publique des Hôpitaux de Paris. Les deux apports principaux sont des performances de dési-dentification supérieures à l'état de l'art en langue française, et l'implémentation d'une chaîne de traitement standardisée librement accessible implémentée sur OMOP-CDM, un mo-dèle commun de représentation des données médicales large-ment utilisé dans le monde.

@InProceedings{Paris2019, 
  title = {{Désidentification de comptes-rendus hospitaliers dans une base de données OMOP}},
  author = {Nicolas Paris and Matthieu Doutreligne and Adrien Parrot and Xavier Tannier},
  booktitle = {Actes de TALMED 2019 : Symposium satellite francophone sur le traitement automatique des langues dans le domaine biomédical}, 
  address = {Lyon, France}, 
  year = {2019}, 
  month = aug
}

Jacques Hilbey, Louise Deléger, Xavier Tannier.

Participation de l’équipe LAI à DEFT 2019.

in Défi Fouille de Texte (DEFT) 2019. Toulouse, France, July 2019.

ⓘ [abstract] [BibTeX] [paper]

We present in this article the methods developed and the results obtained during our participation in task 3 of the DEFT 2019 evaluation campaign. We used simple rule-based or machine-learning approaches ; our results are very good on the information that is simple to extract (age, gender), they remain mixed on the more difficult tasks.

Nous présentons dans cet article les méthodes conçues et les résultats obtenus lors de notre participation à la tâche 3 de la campagne d’évaluation DEFT 2019. Nous avons utilisé des approches simples à base de règles ou d’apprentissage automatique, et si nos résultats sont très bons sur les informationssimples à extraire comme l’âge et le sexe du patient, ils restent mitigés sur les tâches plus difficiles.

@InProceedings{Hilbey2019, 
  title = {{Participation de l’équipe LAI à DEFT 2019}},
  author = {Jacques Hilbey and Louise Deléger and Xavier Tannier},
  booktitle = {Défi Fouille de Texte (DEFT) 2019}, 
  address = {Toulouse, France}, 
  year = {2019}, 
  month = jul
}

Julien Tourille, Matthieu Doutreligne, Olivier Ferret, Nicolas Paris, Aurélie Névéol, Xavier Tannier.

Evaluation of a Sequence Tagging Tool for Biomedical Texts.

in Proceedings of the EMNLP Workshop on Health Text Mining and Information Analysis (LOUHI 2018). Brussels, Belgium, October 2018.

ⓘ [abstract] [BibTeX] [ACL Anthology]

Many applications in biomedical natural language processing rely on sequence tagging asan initial step to perform more complex analysis. To support text analysis in the biomedicaldomain, we introduce Yet Another SEquenceTagger (YASET), an open-source multi purpose sequence tagger that implements state-of-the-art deep learning algorithms for sequencetagging. Herein, we evaluate YASET on part-of-speech tagging and named entity recognition in a variety of text genres including articles from the biomedical literature in English and clinical narratives in French. Tofurther characterize performance, we reportdistributions over 30 runs and different sizesof training datasets. YASET provides state-of-the-art performance on the CoNLL 2003NER dataset (F1=0.87), MEDPOST corpus(F1=0.97), MERLoT corpus (F1=0.99) andNCBI disease corpus (F1=0.81). We believethat YASET is a versatile and efficient tool thatcan be used for sequence tagging in biomedical and clinical texts.

@InProceedings{Tourille2018, 
  title = {{Evaluation of a Sequence Tagging Tool for Biomedical Texts}},
  author = {Julien Tourille and Matthieu Doutreligne and Olivier Ferret and Nicolas Paris and Aurélie Névéol and Xavier Tannier},
  booktitle = {Proceedings of the EMNLP Workshop on Health Text Mining and Information Analysis (LOUHI 2018)}, 
  address = {Brussels, Belgium}, 
  year = {2018}, 
  month = oct
}

Sylvie Cazalens, Philippe Lamarre, Julien Leblay, Ioana Manolescu, Xavier Tannier.

Computational fact-checking: a content management perspective.

Rio de Janeiro, Brazil, August 2018.

Tutorial presented at the conference VLDB.

ⓘ [abstract] [BibTeX] [slides]

The tremendous value of Big Data has been noticed oflate also by the media, and the term “data journalism” hasbeen coined to refer to journalistic work inspired by dig-ital data sources. A particularly popular and active areaof data journalism is concerned with fact-checking. Theterm was born in the journalist community and referred theprocess of verifying and ensuring the accuracy of publishedmedia content; since 2012, however, it has increasingly fo-cused on the analysis of politics, economy, science, and newscontent shared in any form, but first and foremost on theWeb (social and otherwise). These trends have been no-ticed by computer scientists working in the industry andacademia. Thus, a very lively area of digital content man-agement research has taken up these problems and works topropose foundations (models), algorithms, and implementthem through concrete tools.Our proposed tutorial:

Outlines the current state ofaffairs in the area of digital (or computational) fact-checkingin newsrooms, by journalists, NGO workers, scientists andIT companies;
Shows which areas of digital contentmanagement research, in particular those relying on theWeb, can be leveraged to help fact-checking, and gives acomprehensive survey of efforts in this area;
Highlightsongoing trends, unsolved problems, and areas where we envision future scientific and practical advances.

@Misc{Cazalens2018b, 
  title = {{Computational fact-checking: a content management perspective}},
  author = {Sylvie Cazalens and Philippe Lamarre and Julien Leblay and Ioana Manolescu and Xavier Tannier},
  address = {Rio de Janeiro, Brazil}, 
  year = {2018}, 
  month = aug, 
  note = {Tutorial presented at the conference VLDB.}
}

Sylvie Cazalens, Philippe Lamarre, Julien Leblay, Ioana Manolescu, Xavier Tannier.

A Content Management Perspective on Fact-Checking.

in Proceedings of the Web Conference 2018. Lyon, France, April 2018.

ⓘ [abstract] [BibTeX] [pdf] [html]

Fact checking has captured the attention of the media and the public alike; it has also recently received strong attention from the computer science community, in particular from data and knowledge management, natural language processing and information retrieval; we denote these together under the term "content management". In this paper, we identify the fact checking tasks which can be performed with the help of content management technologies, and survey the recent research works in this area, before laying out some perspectives for the future. We hope our work will provide interested researchers, journalists and fact checkers with an entry point in the existing literature as well as help develop a roadmap for future research and development work.

@InProceedings{Cazalens2018, 
  title = {{A Content Management Perspective on Fact-Checking}},
  author = {Sylvie Cazalens and Philippe Lamarre and Julien Leblay and Ioana Manolescu and Xavier Tannier},
  booktitle = {Proceedings of the Web Conference 2018}, 
  address = {Lyon, France}, 
  year = {2018}, 
  month = apr
}

Julien Leblay, Ioana Manolescu, Xavier Tannier.

Computational fact-checking: problems, state of the art, and perspectives.

Lyon, France, April 2018.

Tutorial presented at the Web Conference 2018.

ⓘ [abstract] [BibTeX] [See our more complete VLDB slides]

The tremendous value of Big Data has been noticed of late also by the media, and the term "data journalism'' has been coined to refer to journalistic work inspired by digital data sources. A particularly popular and active area of data journalism is concerned with fact-checking. The term was born in the journalist community and referred the process of verifying and ensuring the accuracy of published media content; since 2012, however, it has increasingly focused on the analysis of politics, economy, science, and news content shared in any form, but first and foremost on the Web (social and otherwise). These trends have been noticed by computer scientists working in the industry and academia. Thus, a very lively area of digital content management research has taken up these problems and works to propose foundations (models), algorithms, and implement them through concrete tools. To cite just one example, Google has recognized the usefulness and importance of fact-checking efforts, by making an effort to index and show them next to links returned to the users.Our tutorial:

Outlines the current state of affairs in the area of digital (or computational) fact-checking in newsrooms, by journalists, NGO workers, scientists and IT companies;
Shows which areas of digital content management research, in particular those relying on the Web, can be leveraged to help fact-checking, and gives a comprehensive survey of efforts in this area;
Highlights ongoing trends, unsolved problems, and areas where we envision future scientific and practical advances.

@Misc{Leblay2018, 
  title = {{Computational fact-checking: problems, state of the art, and perspectives}},
  author = {Julien Leblay and Ioana Manolescu and Xavier Tannier},
  address = {Lyon, France}, 
  year = {2018}, 
  month = apr, 
  note = {Tutorial presented at the Web Conference 2018.}
}

Tien Duc Cao, Ioana Manolescu, Xavier Tannier.

Searching for Truth in a Database of Statistics.

in Proceedings of the 21st International Workshop on the Web and Databases (WebDB 2018). Houston, USA, June 2018.

ⓘ [abstract] [BibTeX]

The proliferation of falsehood and misinformation, in particular through the Web, has lead to increasing energy being invested into journalistic fact-checking. Fact-checking journalists typically check the accuracy of a claim against some trusted data source. Statistic databases such as those compiled by state agencies or by reputed international organizations are often used as trusted data sources, as they contain valuable, high-quality information. However, their usability is limited when they are shared in a format such as HTML or spreadsheets: this makes it hard to find the most relevant dataset for checking a specific claim, or to quickly extract from a dataset the best answer to a given query. We present a novel algorithm enabling the exploitation of such statistic tables, by 1) identifying the statistic datasets most relevant for a given fact-checking query, and 2) extracting from each dataset the best specific (precise) query answer it may contain. We have implemented our approach and experimented on the complete corpus of statistics obtained from INSEE, the French national statistic institute. Our experiments and comparisons demonstrate the effectiveness of our proposed method.

@InProceedings{Cao2018, 
  title = {{Searching for Truth in a Database of Statistics}},
  author = {Cao, Tien Duc and Manolescu, Ioana and Tannier, Xavier},
  booktitle = {Proceedings of the 21st International Workshop on the Web and Databases (WebDB 2018)}, 
  address = {Houston, USA}, 
  year = {2018}, 
  month = jun
}

Judith Jeyafreeda Andrew, Xavier Tannier.

Automatic Extraction of Entities and Relation from Legal Documents.

in Proceedings of the ACL Named Entities Workshop (NEWS 2018). Melbourne, Australia, pages 1-8, July 2018.

ⓘ [abstract] [BibTeX] [ACL Anthology]

In recent years, the journalists and computer sciences speak to each other to identify useful technologies which would help them in extracting useful information. This is called "computational Journalism". In this paper, we present a method that will enable the journalists to automatically identifies and annotates entities such as names of people, organizations, role and functions of people in legal documents; the relationship between these entities are also explored. The system uses a combination of both statistical and rule based technique. The statistical method used is Conditional Random Fields and for the rule based technique, document and language specific regular expressions are used.

@InProceedings{Andrew2018, 
  title = {{Automatic Extraction of Entities and Relation from Legal Documents}},
  author = {Andrew, Judith Jeyafreeda and Tannier, Xavier},
  booktitle = {Proceedings of the ACL Named Entities Workshop (NEWS 2018)}, 
  address = {Melbourne, Australia}, 
  year = {2018}, 
  month = jul, 
  pages = {1-8}
}

Tien Duc Cao, Ioana Manolescu, Xavier Tannier.

Extracting Linked Data from statistic spreadsheets.

in 34ème Conférence sur la Gestion de Données – Principes, Technologies et Applications (BDA 2018). Bucarest, Romania, October 2018.

ⓘ [abstract] [BibTeX]

Fact-checking journalists typically check the accuracy of a claimagainst some trusted data source. Statistic databases suchas those compiled by state agencies are often used as trusted data sources, as they contain valuable, high-quality information. However, their usability is limited when they are shared in a format such as HTML or spreadsheets: this makes it hard to find the most relevant dataset for checking a specific claim, or to quickly extract from a dataset the best answer to a given query. In this work, we provide a conceptual model for the open data comprised instatistics published by INSEE, the national French economic and societalstatistics institute. Then, we describe a novel method for extractingRDF Linked Open Data, to populate an instance of this model.We used our method to produce RDF data out of 20k+Excel spreadsheets, and our validation indicates a 91% rate ofsuccessful extraction.Further, we also present a novel algorithm enabling the exploitationof such statistic tables, by (i) identifying the statistic datasetsmost relevant for a given fact-checking query, and (ii) extractingfrom each dataset the best specific (precise) query answer it maycontain. We have implemented our approach and experimented on thecomplete corpus of statistics obtained from INSEE, the French nationalstatistic institute. Our experiments and comparisons demonstrate the effectiveness of our proposed method.

@InProceedings{Cao2018b, 
  title = {{Extracting Linked Data from statistic spreadsheets}},
  author = {Cao, Tien Duc and Manolescu, Ioana and Tannier, Xavier},
  booktitle = {34ème Conférence sur la Gestion de Données – Principes, Technologies et Applications (BDA 2018)}, 
  address = {Bucarest, Romania}, 
  year = {2018}, 
  month = oct
}

Julien Tourille, Olivier Ferret, Xavier Tannier, Aurélie Névéol.

Neural Architecture for Temporal Relation Extraction: A Bi-LSTM Approach for Detecting Narrative Containers.

in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017, short paper). Vancouver, Canada, August 2017.

ⓘ [abstract] [BibTeX] [ACL Anthology]

We present a neural architecture for containment relation identification between medical events and/or temporal expressions. We experiment on a corpus of de-identified clinical notes in English from the Mayo Clinic, namely the THYME corpus. Our model achieves an F-measure of 0.591 and outperforms the best results reported on this corpus to date.

@InProceedings{Tourille2017b, 
  title = {{Neural Architecture for Temporal Relation Extraction: A Bi-LSTM Approach for Detecting Narrative Containers}},
  author = {Tourille, Julien and Ferret, Olivier and Tannier, Xavier and Névéol, Aurélie},
  booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017, short paper)}, 
  address = {Vancouver, Canada}, 
  year = {2017}, 
  month = aug
}

Jose Moreno, Romaric Besançon, Romain Beaumont, Eva D'Hondt, Anne-Laure Ligozat, Sophie Rosset, Xavier Tannier, Brigitte Grau.

Combining Word and Entity Embeddings for Entity Linking.

in Proceedings of the 14th Extended Semantic Web Conference (ESWC 2017). Portorož, Slovenia, May 2017.

ⓘ [abstract] [BibTeX] [SpringerLink]

The correct identification of the link between an entity mention in a text and a known entity in a large knowledge base is important in information retrieval or information extraction. The general approach for this task is to generate, for a given mention, a set of candidate entities from the base and, in a second step, determine which is the best one. This paper proposes a novel method for the second step which is based on the joint learning of embeddings for the words in the text and the entities in the knowledge base. By learning these embeddings in the same space we arrive at a more conceptually grounded model that can be used for candidate selection based on the surrounding context.The relative improvement of this approach is experimentally validated on a benchmark corpus from the TAC-EDL 2015 evaluation campaign.

@InProceedings{Moreno2017, 
  title = {{Combining Word and Entity Embeddings for Entity Linking}},
  author = {Jose Moreno and Romaric Besançon and Romain Beaumont and Eva D'Hondt and Anne-Laure Ligozat and Sophie Rosset and Xavier Tannier and Brigitte Grau},
  booktitle = {Proceedings of the 14th Extended Semantic Web Conference (ESWC 2017)}, 
  address = {Portorož, Slovenia}, 
  year = {2017}, 
  month = may
}

Julien Tourille, Olivier Ferret, Xavier Tannier, Aurélie Névéol.

Temporal information extraction from clinical text.

in Proceedings of the European Chapter of the ACL (EACL 2017, short paper). Valencia, Spain, April 2017.

ⓘ [abstract] [BibTeX] [poster] [ACL Anthology]

In this paper, we present a method for temporal relation extraction from clinical narratives in French and in English. We experiment on two comparable corpora, the MERLOT corpus for French and the THYME corpus for English, and show that a common approach can be used for both languages.

@InProceedings{Tourille2017, 
  title = {{Temporal information extraction from clinical text}},
  author = {Tourille, Julien and Ferret, Olivier and Tannier, Xavier and Névéol, Aurélie},
  booktitle = {Proceedings of the European Chapter of the ACL (EACL 2017, short paper)}, 
  address = {Valencia, Spain}, 
  year = {2017}, 
  month = apr
}

Swen Ribeiro, Olivier Ferret, Xavier Tannier.

Unsupervised Event Clustering and Aggregation from Newswire and Web Articles.

in Proceedings of the 2nd workshop "Natural Language meets Journalism" (EMNLP 2017). Copenhagen, Denmark, September 2017.

ⓘ [abstract] [BibTeX] [ACL Anthology]

In this paper we present an unsupervised pipeline approach for clustering news articles based on identified event instances in their content. We leverage press agency newswire and monolingual word alignment techniques to build meaningful and linguistically varied clusters of articles from the Web in the perspective of a broader event type detection task. We validate our approach on a manually annotated corpus of Web articles.

@InProceedings{Ribeiro2017, 
  title = {{Unsupervised Event Clustering and Aggregation from Newswire and Web Articles}},
  author = {Ribeiro, Swen and Ferret, Olivier and Tannier, Xavier},
  booktitle = {Proceedings of the 2nd workshop "Natural Language meets Journalism" (EMNLP 2017)}, 
  address = {Copenhagen, Denmark}, 
  year = {2017}, 
  month = sep
}

Julien Tourille, Olivier Ferret, Xavier Tannier, Aurélie Névéol.

LIMSI-COT at SemEval-2017 Task 12: Neural Architecture for Temporal Information Extraction from Clinical Narratives.

in Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval 2017). Vancouver, Canada, August 2017.

Selected for the "Best of SemEval 2017"

ⓘ [abstract] [BibTeX] [ACL anthology]

In this paper we present our participation to SemEval 2017 Task 12. We used aneural network based approach for entity and temporal relation extraction, and experimented with two domain adaptation strategies. We achieved competitive performance for both tasks.

@InProceedings{Tourille2017c, 
  title = {{LIMSI-COT at SemEval-2017 Task 12: Neural Architecture for Temporal Information Extraction from Clinical Narratives}},
  author = {Tourille, Julien and Ferret, Olivier and Tannier, Xavier and Névéol, Aurélie},
  booktitle = {Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval 2017)}, 
  address = {Vancouver, Canada}, 
  year = {2017}, 
  month = aug
}

Tien Duc Cao, Ioana Manolescu, Xavier Tannier.

Extracting Linked Data from statistic spreadsheets.

in Proceedings of the SIGMOD workshop on Semantic Big Data (SBD 2017). Chicago, USA, May 2017.

ⓘ [abstract] [BibTeX] [ACMLink] [paper]

Statistic data is an important sub-category of open data; it is interesting for many applications, including but not limited to data journalism, as such data is typically of high quality, and reflects (under an aggregated form) important aspects of a society’s life such as births, immigration, economic output etc. However, such open data is often not published as Linked Open Data (LOD) limiting its usability.We provide a conceptual model for the open data comprised in statistic files published by INSEE, the leading French economic and societal statistics institute. Then, we describe a novel method for extracting RDF LOD populating an instance of this conceptual model. Our method was used to produce RDF data out of 20k+ Excel spreadsheets, and our validation indicates a 91% rate of successful extraction.

@InProceedings{Cao2017, 
  title = {{Extracting Linked Data from statistic spreadsheets}},
  author = {Cao, Tien Duc and Manolescu, Ioana and Tannier, Xavier},
  booktitle = {Proceedings of the SIGMOD workshop on Semantic Big Data (SBD 2017)}, 
  address = {Chicago, USA}, 
  year = {2017}, 
  month = may
}

José Moreno, Romaric Besançon, Romain Beaumont, Eva D'Hondt, Anne-Laure Ligozat, Sophie Rosset, Xavier Tannier, Brigitte Grau.

Apprendre des représentations jointes de mots et d'entités pour la désambiguïsation d'entités.

in Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2017). Orléans, France, June 2017.

ⓘ [abstract] [BibTeX]

The correct identification of the link between an entity mention in a text anda known entity in a large knowledge base is important in information retrieval or information extraction. However, systems have to deal with ambiguity as numerous entities could be linked to a mention. This paper proposes a novel method for entity disambiguation which is based on the joint learning of embeddings for the words in the text and the entities in the knowledge base. By learning these embeddings in the same space we arrive at a more conceptually grounded model that can be used for candidate selection based on the surrounding context.

La désambiguïsation d'entités (ou liaison d'entités), qui consiste à relier des mentions d'entités d'un texte à des entités d'une base de connaissance, est un problème qui se pose, en particulier, pour le peuplement automatique de bases de connaissances par extraction d'information à partir de textes. Une difficulté principale de cette tâche est la résolution d'ambiguïtés car les systèmes ont à choisir parmi un nombre important de candidats. Cet article propose une nouvelle approche fondée sur l'apprentissage joint de représentations distribuées des mots et des entités dans le même espace, ce qui permet d'établir un modèle solide pour la comparaison entre le contexte local de la mention d'entité et les entités candidates.

@InProceedings{Moreno2017b, 
  title = {{Apprendre des représentations jointes de mots et d'entités pour la désambiguïsation d'entités}},
  author = {José Moreno and Romaric Besançon and Romain Beaumont and Eva D'Hondt and Anne-Laure Ligozat and Sophie Rosset and Xavier Tannier and Brigitte Grau},
  booktitle = {Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2017)}, 
  address = {Orléans, France}, 
  year = {2017}, 
  month = jun
}

Xavier Tannier.

NLP-driven Data Journalism: Time-Aware Mining and Visualization of International Alliances.

in Proceedings of "Natural Language meets Journalism", workshop of the International Joint Conference on Artificial Intelligence (IJCAI 2016). New York, USA, July 2016.

ⓘ [abstract] [BibTeX] [poster] [paper]

We take inspiration of computational and data journalism, and propose to combine techniques from information extraction, information aggregation and visualization to build a tool identifying the evolution of alliance and opposition relations between countries, on specific topics. These relations are aggregated into numerical data that are visualized by time-series plots or dynamic graphs.

@InProceedings{Tannier2016a, 
  title = {{NLP-driven Data Journalism: Time-Aware Mining and Visualization of International Alliances}},
  author = {Xavier Tannier},
  booktitle = {Proceedings of "Natural Language meets Journalism", workshop of the International Joint Conference on Artificial Intelligence (IJCAI 2016)}, 
  address = {New York, USA}, 
  year = {2016}, 
  month = jul
}

Xavier Tannier, Frédéric Vernier.

Creation, Visualization and Edition of Timelines for Journalistic Use.

in Proceedings of "Natural Language meets Journalism", workshop of the International Joint Conference on Artificial Intelligence (IJCAI 2016). New York, USA, July 2016.

ⓘ [abstract] [BibTeX] [paper] [slides]

We describe in this article a system for building and visualizing thematic timelines automatically. The input of the system is a set of keywords, together with temporal user-specified boundaries. The output is a timeline graph showing at the same time the chronology and the importance of the events concerning the query. This requires natural language processing and information retrieval techniques, allied to a very specific temporal smoothing and visualization approach. The result can be edited so that the journalist always has the final say on what is finally displayed to the reader.

@InProceedings{Tannier2016b, 
  title = {{Creation, Visualization and Edition of Timelines for Journalistic Use}},
  author = {Xavier Tannier and Frédéric Vernier},
  booktitle = {Proceedings of "Natural Language meets Journalism", workshop of the International Joint Conference on Artificial Intelligence (IJCAI 2016)}, 
  address = {New York, USA}, 
  year = {2016}, 
  month = jul
}

Sergio Torres Aguilar, Xavier Tannier, Pierre Chastang.

Named entity recognition applied on a data base of Medieval Latin charters. The case of chartae burgundiae.

in Proceedings of the 3rd International Workshop on Computational History (HistoInformatics 2016). Krakow, Poland, July 2016.

ⓘ [abstract] [BibTeX] [paper]

We take inspiration of computational and data journalism, and propose to combine techniques from information extraction, information aggregation and visualization to build a tool identifying the evolution of alliance and opposition relations between countries, on specific topics. These relations are aggregated into numerical data that are visualized by time-series plots or dynamic graphs.

@InProceedings{Torres2016, 
  title = {{Named entity recognition applied on a data base of Medieval Latin charters. The case of chartae burgundiae}},
  author = {Torres Aguilar, Sergio and Tannier, Xavier and Chastang, Pierre},
  booktitle = {Proceedings of the 3rd International Workshop on Computational History (HistoInformatics 2016)}, 
  address = {Krakow, Poland}, 
  year = {2016}, 
  month = jul
}

Maria Pontiki, Dimitrios Galanis, Haris Papageorgiou, Ion Androutsopoulos, Suresh Manandhar, Mohammad Al-Smadi, Mahmoud Al-Ayyoub, Yanyan Zhao, Bing Qin, Orphée De Clecq, Véronique Hoste, Marianna Apidianaki, Xavier Tannier, Natalia Loukachevitch, Evgeny Kotelnikov, Nuria Bel, Salud María Jiménez-Zafra, Gülşen Eryiğit.

SemEval-2016 Task 5: Aspect Based Sentiment Analysis.

in Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016). San Diego, USA, June 2016.

ⓘ [abstract] [BibTeX] [Annotation Guidelines] [Paper]

This paper describes the SemEval 2016 sharedtask on Aspect Based Sentiment Analysis(ABSA), a continuation of the respective tasksof 2014 and 2015. In its third year, the taskprovided 19 training and 20 testing datasetsfor 8 languages and 7 domains, as well as acommon evaluation procedure. From thesedatasets, 25 were for sentence-level and 14 fortext-level ABSA; the latter was introduced forthe first time as a subtask in SemEval. The taskattracted 245 submissions from 29 teams.

@InProceedings{Pontiki2016, 
  title = {{SemEval-2016 Task 5: Aspect Based Sentiment Analysis}},
  author = {Pontiki, Maria and Galanis, Dimitrios and Papageorgiou, Haris and Androutsopoulos, Ion and Manandhar, Suresh and Al-Smadi, Mohammad and Al-Ayyoub, Mahmoud and Zhao, Yanyan and Qin, Bing and De Clecq, Orphée and Hoste, Véronique and Apidianaki, Marianna and Tannier, Xavier and Loukachevitch, Natalia and Kotelnikov, Evgeny and Bel, Nuria and Jiménez-Zafra, Salud María and Eryiğit, Gülşen},
  booktitle = {Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016)}, 
  address = {San Diego, USA}, 
  year = {2016}, 
  month = jun
}

Julien Tourille, Olivier Ferret, Aurélie Névéol, Xavier Tannier.

LIMSI-COT at SemEval-2016 Task 12: Temporal relation identification using a pipeline of classifiers.

in Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016). San Diego, USA, June 2016.

Selected for the "Best of SemEval 2016"

ⓘ [abstract] [BibTeX] [paper] ["Best of SemEval" slides] [poster]

SemEval 2016 Task 12 addresses temporal reasoning in the clinical domain. In this paper, we present our participation for relation extraction based on gold standard entities (subtasks DR and CR). We used a supervised approach comparing plain lexical features to word embeddings for temporal relation identification, and obtained above-median scores.

@InProceedings{Tourille2016b, 
  title = {{LIMSI-COT at SemEval-2016 Task 12: Temporal relation identification using a pipeline of classifiers}},
  author = {Tourille, Julien and Ferret, Olivier and Névéol, Aurélie and Tannier, Xavier},
  booktitle = {Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016)}, 
  address = {San Diego, USA}, 
  year = {2016}, 
  month = jun
}

Julien Tourille, Olivier Ferret, Aurélie Névéol, Xavier Tannier.

Extraction de relations temporelles dans des dossiers électroniques patient.

in Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2016, article court). Paris, France, July 2016.

ⓘ [abstract] [BibTeX] [poster] [free copy]

Temporal analysis of clinical documents yields complex representations of the information contained in Electronic Health Records. This type of analysis relies on the extraction of medical events, temporal expressions and the relations between them. In this work, we assume that relevant events and temporal expressions are available and we focus on the extraction of relations between two events or between an event and a temporal expression. We present supervised classification models and apply them to clinical documents written in French and in English. The performance we achieve is high and similar an both languages. We believe these results suggest that temporal analysis may be approached generically across clinical domains and languages.

L'analyse temporelle des documents cliniques permet d'obtenir des représentations riches des informations contenues dans les dossiers électroniques patient. Cette analyse repose sur l'extraction d'événements, d'expressions temporelles et des relations entre eux. Dans ce travail, nous considérons que nous disposons des événements et des expressions temporelles pertinents et nous nous intéressons aux relations temporelles entre deux événements ou entre un événement et une expression temporelle. Nous présentons des modèles de classification supervisée pour l'extraction de des relations en français et en anglais. Les performances obtenues sont similaires dans les deux langues, suggérant ainsi que différents domaines cliniques et différentes langues pourraient être abordés de manière similaire.

@InProceedings{Tourille2016a, 
  title = {{Extraction de relations temporelles dans des dossiers électroniques patient}},
  author = {Tourille, Julien and Ferret, Olivier and Névéol, Aurélie and Tannier, Xavier},
  booktitle = {Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2016, article court)}, 
  address = {Paris, France}, 
  year = {2016}, 
  month = jul
}

Kiem-Hieu Nguyen, Xavier Tannier, Olivier Ferret, Romaric Besançon.

A Dataset for Open Event Extraction in English.

in Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia, May 2016.

ⓘ [abstract] [BibTeX] [poster] [free copy]

This article presents a corpus for development and testing of event schema induction systems in English. Schema induction is the task of learning templates with no supervision from unlabeled texts, and to group together entities corresponding to the same role in a template. Most of the previous work on this subject relies on the MUC-4 corpus. We describe the limits of using this corpus (size, non-representativeness, similarity of roles across templates) and propose a new, partially-annotated corpus in English which remedies some of these shortcomings. We make use of Wikinews to select the data inside the category Laws & Justice, and query Google search engine to retrieve different documents on the same events. Only Wikinews documents are manually annotated and can be used for evaluation, while the others can be used for unsupervised learning. We detail the methodology used for building the corpus and evaluate some existing systems on this new data.

@InProceedings{Nguyen2016, 
  title = {{A Dataset for Open Event Extraction in English}},
  author = {Kiem-Hieu Nguyen and Xavier Tannier and Olivier Ferret and Romaric Besançon},
  booktitle = {Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)}, 
  address = {Portorož, Slovenia}, 
  year = {2016}, 
  month = may
}

Marianna Apidianaki, Xavier Tannier, Cécile Richart.

Datasets for Aspect-Based Sentiment Analysis in French.

in Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia, May 2016.

ⓘ [abstract] [BibTeX] [poster] [free copy]

Aspect Based Sentiment Analysis (ABSA) is the task of mining and summarizing opinions from text about specific entities and their aspects. This article describes two datasets for the development and testing of ABSA systems for French which comprise user reviews annotated with relevant entities, aspects and polarity values. The first dataset contains 457 restaurant reviews (2365 sentences) for training and testing ABSA systems, while the second contains 162 museum reviews (655 sentences) dedicated to out-of-domain evaluation. Both datasets were built as part of SemEval-2016 Task 5 "Aspect-Based Sentiment Analysis" where seven different languages were represented, and are publicly available for research purposes. This article provides examples and statistics by annotation type, summarizes the annotation guidelines and discusses their cross-lingual applicability. It also explains how the data was used for evaluation in the SemEval ABSA task and briefly presents the results obtained for French.

@InProceedings{Apidianaki2016, 
  title = {{Datasets for Aspect-Based Sentiment Analysis in French}},
  author = {Marianna Apidianaki and Xavier Tannier and Cécile Richart},
  booktitle = {Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)}, 
  address = {Portorož, Slovenia}, 
  year = {2016}, 
  month = may
}

Aurélie Névéol, K. Bretonnel Cohen, Cyril Grouin, Thierry Hamon, Thomas Lavergne, Liadh Kelly, Lorraine Goeuriot, Grégoire Rey, Aude Robert, Xavier Tannier, Pierre Zweigenbaum.

Clinical Information Extraction at the CLEF eHealth Evaluation lab 2016.

in CLEF 2016 (online working notes). Evora, Portugal, September 2016.

ⓘ [abstract] [BibTeX] [free copy]

This paper reports on Task 2 of the 2016 CLEF eHealth eval-uation lab which extended the previous information extraction tasks ofShARe/CLEF eHealth evaluation labs. The task continued with namedentity recognition and normalization in French narratives, as offered in CLEF eHealth 2015. Named entity recognition involved ten types of entities including disorders that were defined according to Semantic Groups in the Unified Medical Language System (UMLS), which was also used for normalizing the entities. In addition, we introduced a large-scale classification task in French death certificates, which consisted ofextracting causes of death as coded in the International Classification of Diseases, tenth revision (ICD10). Participant systems were evaluated against a blind reference standard of 832 titles of scientific articles indexed in MEDLINE, 4 drug monographs published by the European Medicines Agency (EMEA) and 27,850 death certificates using Precision, Recall and F-measure. In total, seven teams participated, including five in the entity recognition and normalization task, and five in the death certificate coding task. Three teams submitted their systems to our newly offered reproducibility track. For entity recognition, the highest performance was achieved on the EMEA corpus, with an overall F-measure of 0.702 for plain entities recognition and 0.529 for normalized entity recognition. For entity normalization, the highest performance was achieved on the MEDLINE corpus, with an overall F-measure of 0.552. For death certificate coding, the highest performance was 0.848 F-measure.

@InProceedings{Neveol2016, 
  title = {{Clinical Information Extraction at the CLEF eHealth Evaluation lab 2016}},
  author = {Névéol, Aurélie and Cohen, K. Bretonnel and Grouin, Cyril and Hamon, Thierry and Lavergne, Thomas and Kelly, Liadh and Goeuriot, Lorraine and Rey, Grégoire and Robert, Aude and Tannier, Xavier and Zweigenbaum, Pierre},
  booktitle = {CLEF 2016 (online working notes)}, 
  address = {Evora, Portugal}, 
  year = {2016}, 
  month = sep
}

Kiem-Hieu Nguyen, Xavier Tannier, Olivier Ferret, Romaric Besançon.

Generative Event Schema Induction with Entity Disambiguation.

in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2015). Beijing, China, July 2015.

ⓘ [abstract] [BibTeX] [slides] [video] [ACL anthology]

This paper presents a generative model to event schema induction. Previous methods in the literature only use head words to represent entities. However, elements other than head words contain useful information. For instance, an armed man is more discriminative than man. Our model takes into account this information and precisely represents it using probabilistic topic distributions. We illustrate that such information plays an important role in parameter estimation. Mostly, it makes topic distributions more coherent and more discriminative. Experimental results on benchmark dataset empirically confirm this enhancement.

@InProceedings{Nguyen2015, 
  title = {{Generative Event Schema Induction with Entity Disambiguation}},
  author = {Kiem-Hieu Nguyen and Xavier Tannier and Olivier Ferret and Romaric Besançon},
  booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2015)}, 
  address = {Beijing, China}, 
  year = {2015}, 
  month = jul
}

Mike Donald Tapi Nzali, Aurélie Névéol, Xavier Tannier.

Automatic Extraction of Time Expressions Accross Domains in French Narratives.

in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2015, short paper). Lisbon, Portugal, September 2015.

ⓘ [abstract] [BibTeX] [ACL Anthology]

The prevalence of temporal referencesacross all types of natural language utterances makes temporal analysis a key issue in Natural Language Processing. Thiswork adresses three research questions:1/is temporal expression recognition specific to a particular domain? 2/if so, can we characterize domain specificity? and3/how can subdomain specificity be integrated in a single tool for unified temporalexpression extraction? Herein, we assess temporal expression recognition from documents written in French covering three domains. We present a new corpus of clinical narratives annotated for temporal expressions, and also use existing corpora in the newswire and historical domains. We show that temporal expressions can be extracted with high performance across domains (best F-measure 0.96 obtained with a CRF model on clinical narratives). We argue that domain adaptation for the extraction of temporal expressions can be done with limited efforts and should cover pre-processing as well as temporal specific tasks.

@InProceedings{TapiNzali2015b, 
  title = {{Automatic Extraction of Time Expressions Accross Domains in French Narratives}},
  author = {Tapi Nzali, Mike Donald and Névéol, Aurélie and Tannier, Xavier},
  booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2015, short paper)}, 
  address = {Lisbon, Portugal}, 
  year = {2015}, 
  month = sep
}

Béatrice Arnulphy, Vincent Claveau, Xavier Tannier, Anne Vilnat.

Supervised Machine Learning Techniques to Detect TimeML Events in French and English.

in Proceedings of the 20th International Conference on Applications of Natural Language to Information Systems (NLDB 2015). Passau, Germany, June 2015.

ⓘ [abstract] [BibTeX] [SpringerLink] [paper]

Identifying events from texts is an information extraction task necessary for many NLP applications. Through the TimeML specifications and TempEval challenges, it has received some attention in the last years, yet, no reference result is available for French. In this paper, we try to fill this gap by proposing several event extraction systems, combining for instance Conditional Random Fields, language modeling and k-nearest-neighbors. These systems are evaluated on French corpora and compared with state-of-the-art methods on English. The very good results obtained on both languages validate our whole approach.

@InProceedings{Arnulphy2015, 
  title = {{Supervised Machine Learning Techniques to Detect TimeML Events in French and English}},
  author = {Béatrice Arnulphy and Vincent Claveau and Xavier Tannier and Anne Vilnat},
  booktitle = {Proceedings of the 20th International Conference on Applications of Natural Language to Information Systems (NLDB 2015)}, 
  address = {Passau, Germany}, 
  year = {2015}, 
  month = jun
}

Aurélie Névéol, Cyril Grouin, Xavier Tannier, Thierry Hamon, Liadh Kelly, Lorraine Goeuriot ad Pierre Zweigenbaum.

CLEF eHealth Evaluation Lab 2015 Task 1b: clinical named entity recognition.

in Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2015, CEUR-WS 1391). Toulouse, France, September 2015.

ⓘ [abstract] [BibTeX] [CEUR-WS copy]

This paper reports on Task 1b of the 2015 CLEF eHealthevaluation lab which extended the previous information extraction tasksof ShARe/CLEF eHealth evaluation labs by considering ten types of entities including disorders, that were to be extracted from biomedical textin French. The task consisted of two phases: entity recognition (phase1), in which participants could supply plain or normalized entities, andentity normalization (phase 2). The entities to be extracted were definedaccording to Semantic Groups in the Unified Medical Language System), which was also used for normalizing the entities. Participantsystems were evaluated against a blind reference standard of 832 titles ofscientific articles indexed in MEDLINE and 3 full text drug monographspublished by the European Medicines Agency (EMEA) using Precision,Recall and F-measure. In total, seven teams participated in phase 1,and three teams in phase 2. The highest performance was obtained onthe EMEA corpus, with an overall F-measure of 0.756 for plain entityrecognition, 0.711 for normalized entity recognition and 0.872 for entitynormalization.

@InProceedings{Neveol2015, 
  title = {{CLEF eHealth Evaluation Lab 2015 Task 1b: clinical named entity recognition}},
  author = {Aurélie Névéol and Cyril Grouin and Xavier Tannier and Thierry Hamon and Liadh Kelly and Lorraine Goeuriot ad Pierre Zweigenbaum},
  booktitle = {Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2015, CEUR-WS 1391)}, 
  address = {Toulouse, France}, 
  year = {2015}, 
  month = sep
}

Kiem-Hieu Nguyen, Xavier Tannier, Olivier Ferret, Romaric Besançon.

Désambiguïsation d'entités pour l'induction non supervisée de schémas événementiels.

in Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2015). Caen, France, June 2015.

ⓘ [abstract] [BibTeX] [free copy]

In this paper, we present an approach for event induction with a generative model. This model makes possible to consider more relational information than previous models, and has been applied to noun attributes. By their influence on parameter estimation, this new information make probabilistic topic distribution more discriminative and more robust. We evaluated different versions of our model on MUC-4 datasets.

Cet article présente un modèle génératif pour l'induction non supervisée d'événements. Les précédentes méthodes de la littérature utilisent uniquement les têtes des syntagmes pour représenter les entités. Pourtant, le groupe complet (par exemple, "un homme armé") apporte une information plus discriminante (que "homme"). Notre modèle tient compte de cette information et la représente dans la distribution des schémas d'événements. Nous montrons que ces relations jouent un rôle important dans l'estimation des paramètres, et qu'elles conduisent à des distributions plus cohérentes et plus discriminantes. Les résultats expérimentaux sur le corpus de MUC-4 confirment ces progrès.

@InProceedings{Nguyen2015a, 
  title = {{Désambiguïsation d'entités pour l'induction non supervisée de schémas événementiels}},
  author = {Kiem-Hieu Nguyen and Xavier Tannier and Olivier Ferret and Romaric Besançon},
  booktitle = {Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2015)}, 
  address = {Caen, France}, 
  year = {2015}, 
  month = jun
}

Mike Donald Tapi Nzali, Aurélie Névéol, Xavier Tannier.

Analyse d'expressions temporelles dans les dossiers électroniques patients.

in Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2015). Caen, France, June 2015.

ⓘ [abstract] [BibTeX] [Slides in pdf] [free copy]

References to phenomena ocurring in the world and their temporal caracterization can be found in a variety of natural language utterances. For this reason, temporal analysis is a key issue in natural language processing. This article presents a temporal analysis of specialized documents. We use a corpus of documents contained in several de-identified Electronic Health Records to develop an annotated resource of temporal expressions relying on the TimeML standard. We then use this corpus to evaluate several methods for the automatic extraction of temporal expressions. Our best statistical model yields 0.91 F-measure, which provides significant improvement on extraction, over the state-of-the-art system HeidelTime. We also compare our medical corpus to FR-Timebank in order to characterize the uses of temporal expressions in two different subdomains.

Les références à des phénomènes du monde réel et à leur caractérisation temporelle se retrouvent dans beaucoup de types de discours en langue naturelle. Ainsi, l’analyse temporelle apparaît comme un élément important en traitement automatique de la langue. Cet article présente une analyse de textes en domaine de spécialité du point de vue temporel. En s'appuyant sur un corpus de documents issus de plusieurs dossiers électroniques patient désidentifiés, nous décrivons la construction d'une ressource annotée en expressions temporelles selon la norme TimeML. Par suite, nous utilisons cette ressource pour évaluer plusieurs méthodes d'extraction automatique d'expressions temporelles adaptées au domaine médical. Notre meilleur système statistique offre une performance de 0,91 de F-mesure, surpassant pour l'identification le système état de l'art HeidelTime. La comparaison de notre corpus de travail avec le corpus journalistique FR-Timebank permet également de caractériser les différences d'utilisation des expressions temporelles dans deux domaines de spécialité.

@InProceedings{TapiNzali2015, 
  title = {{Analyse d'expressions temporelles dans les dossiers électroniques patients}},
  author = {Mike Donald Tapi Nzali and Aurélie Névéol and Xavier Tannier},
  booktitle = {Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2015)}, 
  address = {Caen, France}, 
  year = {2015}, 
  month = jun
}

Kiem-Hieu Nguyen, Xavier Tannier, Véronique Moriceau.

Ranking Multidocument Event Descriptions for Building Thematic Timelines.

in Proceedings of the 30th International Conference on Computational Linguistics (Coling 14). Dublin, Ireland, August 2014.

ⓘ [abstract] [BibTeX] [ACL Anthology]

This paper tackles the problem of timeline generation from traditional newssources. Our system builds thematic timelines for a general-domain topic defined by a user query. The system selects and ranks events relevant to the input query. Each event is represented by a one-sentence description in the output timeline.We present an inter-cluster ranking algorithm that takes events from multiple clusters as input and that selects the most salient and relevant events. Acluster, in our work, contains all the events happening in a specific date. Our algorithm utilizes the temporal information derived from a large collection of extensively temporal analyzed texts. Such temporal information is combined with textual contents into an event scoring model in order to rank events based on their salience and query-relevance.

@InProceedings{Nguyen2014a, 
  title = {{Ranking Multidocument Event Descriptions for Building Thematic Timelines}},
  author = {Kiem-Hieu Nguyen and Xavier Tannier and Véronique Moriceau},
  booktitle = {Proceedings of the 30th International Conference on Computational Linguistics (Coling 14)}, 
  address = {Dublin, Ireland}, 
  year = {2014}, 
  month = aug
}

Xavier Tannier.

Traitement des événements et ciblage d'information.

June 2014. Habilitation à Diriger des Recherches (HDR)

ⓘ [abstract] [BibTeX] [Slides in pdf] [Slides in pptx] [Thesis]

Dans ce mémoire, nous organisons nos travaux principaux autour de quatre axes de traitement des informations textuelles : le ciblage, l'agrégation, la hiérarchisation et la contextualisation d'information. La majeure partie du document est dédiée à l'analyse des événements. Nous introduisons d'abord la notion d'événement à travers les diverses spécialités du traitement automatique des langues qui s'en sont préoccupées. Nous proposons ainsi un survol des différents modes de représentation des événements, tout en instaurant un fil rouge pour l'ensemble de la première partie. Nous distinguons ensuite deux grand es classes de travaux autour des événements, deux grandes visions que nous avons nommées, pour la première, l'"événement dans le texte", et pour la seconde, l'"événement dans le monde". Dans la première, nous considérons l'événement comme la désignation linguistique de quelque chose qui se passe, et nous tentons d'une part d'identifier ces désignations dans les textes, et d'autre part d'induire les relations temporelles existant entre ces événements, que ce soit dans des textes journalistiques ou médicaux. Nous réfléchissons enfin à une métrique d'évaluation adaptée à ce type d'informations. Pour ce qui est de l'"événement dans le monde", nous envisageons plus l'événement tel qu'il est perçu par le citoyen, et nous proposons plusieurs approches originales pour aider celui-ci à mieux appréhender la quantité écrasante d'événements dont il prend connaissance chaque jour : les chronologies thématiques, les fils temporels, et une approche automatisée du journalisme de données. La deuxième partie revient sur des travaux en lien avec le ciblage d'information. Nous décrivons tout d'abord nos travaux sur les systèmes de questions-réponses, dans les quels nous avons eu recours à l'analyse syntaxique pour aider à justifier les réponses trouvées à une question en langage naturel. Enfin, nous abordons le sujet de la collecte thématique de documents sur le Web, dans le but de créer automatiquement des corpus et des lexiques spécialisés. Enfin, nous concluons et revenons sur les perspectives associées aux travaux présentés sur les événements, avec pour but d'abolir partiellement la frontière qui séparent les différents axes présentés.

@Misc{Tannier2014b, 
  title = {{Traitement des événements et ciblage d'information}},
  author = {Xavier Tannier},
  year = {2014}, 
  month = jun, 
  school = {Université Paris-Sud, École Doctorale d'Informatique}, 
  howpublished = {Habilitation à Diriger des Recherches (HDR)}, 
  note = {}
}

Clément De Groc, Xavier Tannier, Claude De Loupy.

Thematic Cohesion: Measuring Terms Discriminatory Power Toward Themes.

in Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014). Reykjavík, Iceland, May 2014.

ⓘ [abstract] [BibTeX] [Slides] [free copy]

We present a new measure of thematic cohesion. This measure associates each term with a weight representing its discriminatory power toward a theme, this theme being itself expressed by a list of terms (a thematic lexicon). This thematic cohesion criterion can be used in many applications, such as query expansion, computer-assisted translation, or iterative construction of domain-specific lexicons and corpora. The measure is computed in two steps. First, a set of documents related to the terms is gathered from the Web by querying a Web search engine. Then, we produce an oriented co-occurrence graph, where vertices are the terms and edges represent the fact that two terms co-occur in a document. This graph can be interpreted as a recommendation graph, where two terms occurring in a same document means that they recommend each other. This leads to using a random walk algorithm that assigns a global importance value to each vertex of the graph. After observing the impact of various parameters on those importance values, we evaluate their correlation with retrieval effectiveness.

@InProceedings{DeGroc2014a, 
  title = {{Thematic Cohesion: Measuring Terms Discriminatory Power Toward Themes}},
  author = {Clément De Groc and Xavier Tannier and Claude De Loupy},
  booktitle = {Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014)}, 
  address = {Reykjavík, Iceland}, 
  year = {2014}, 
  month = may
}

Clément De Groc, Xavier Tannier.

Evaluating Web-as-corpus Topical Document Retrieval with an Index of the OpenDirectory.

in Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014). Reykjavík, Iceland, May 2014.

ⓘ [abstract] [BibTeX] [Slides] [free copy]

This article introduces a novel protocol and resource to evaluate Web-as-corpus topical document retrieval. To the contrary of previous work, our goal is to provide an automatic, reproducible and robust evaluation for this task. We rely on the OpenDirectory (DMOZ) as a source of topically annotated webpages and index them in a search engine. With this OpenDirectory search engine, we can then easily evaluate the impact of various parameters such as the number of seed terms, queries or documents, or the usefulness of various term selection algorithms. A first fully automatic evaluation is described and provides baseline performances for this task. The article concludes with practical information regarding the availability of the index and resource files.

@InProceedings{DeGroc2014b, 
  title = {{Evaluating Web-as-corpus Topical Document Retrieval with an Index of the OpenDirectory}},
  author = {Clément De Groc and Xavier Tannier},
  booktitle = {Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014)}, 
  address = {Reykjavík, Iceland}, 
  year = {2014}, 
  month = may
}

Véronique Moriceau, Xavier Tannier.

French Resources for Extraction and Normalization of Temporal Expressions with HeidelTime.

in Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014). Reykjavík, Iceland, May 2014.

ⓘ [abstract] [BibTeX] [Poster] [free copy]

In this paper, we describe the development of French resources for the extraction and normalization of temporal expressions with HeidelTime, a open-source multilingual, cross-domain temporal tagger. HeidelTime extracts temporal expressions from documents and normalizes them according to the TIMEX3 annotation standard. Several types of temporal expressions are extracted: dates, times, durations and temporal sets. French resources have been evaluated in two different ways: on the French TimeBank corpus, a corpus of newspaper articles in French annotated according to the ISO-TimeML standard, and on a user application for automatic building of event timelines. Results on the French TimeBank are quite satisfaying as they are comparable to those obtained by HeidelTime in English and Spanish on newswire articles. Concerning the user application, we used two temporal taggers for the preprocessing of the corpus in order to compare their performance and results show that the performances of our application on French documents are better with HeidelTime. The French resources and evaluation scripts are publicly available with HeidelTime.

@InProceedings{Moriceau2014a, 
  title = {{French Resources for Extraction and Normalization of Temporal Expressions with HeidelTime}},
  author = {Véronique Moriceau and Xavier Tannier},
  booktitle = {Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014)}, 
  address = {Reykjavík, Iceland}, 
  year = {2014}, 
  month = may
}

Xavier Tannier.

Extracting News Web Page Creation Time with DCTFinder.

in Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014). Reykjavík, Iceland, May 2014.

ⓘ [abstract] [BibTeX] [free copy] [Poster]

Web pages do not offer reliable metadata concerning their creation date and time. However, getting the document creation time is a necessary step for allowing to apply temporal normalization systems to web pages. In this paper, we present DCTFinder, a system that parses a web page and extracts from its content the title and the creation date of this web page. DCTFinder combines heuristic title detection, supervised learning with Conditional Random Fields (CRFs) for document date extraction, and rule-based creation time recognition. Using such a system allows further deep and efficient temporal analysis of web pages. Evaluation on three corpora of English and French web pages indicates that the tool can extract document creation times with reasonably high accuracy (between 87 and 92%).
DCTFinder is made freely available on http://sourceforge.net/projects/dctfinder/, as well as all resources (vocabulary and annotated documents) built for training and evaluating the system in English and French, and the English trained model itself.

@InProceedings{Tannier2014a, 
  title = {{Extracting News Web Page Creation Time with DCTFinder}},
  author = {Xavier Tannier},
  booktitle = {Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014)}, 
  address = {Reykjavík, Iceland}, 
  year = {2014}, 
  month = may
}

Béatrice Arnulphy, Vincent Claveau, Xavier Tannier, Anne Vilnat.

Techniques d’apprentissage supervisé pour l’extraction d’événements TimeML en anglais et français.

in Actes de la COnférence en Recherche d'Information et ses Applications (CORIA 2014). Nancy, France, March 2014.

ⓘ [abstract] [BibTeX] [free copy]

Identifying events from texts is an information extraction task necessary for manyNLP applications. Through the TimeML specifications and TempEval challenges, it has received some attention in the last years, yet, no reference result is available for French. In thispaper, we try to fill this gap by proposing several event extraction systems, combining for instance Conditional Random Fields, language modeling and k-nearest-neighbors. These systemsare evaluated on French corpora and compared with state-of-the-art methods on English. Thevery good results obtained on both languages validate our whole approach.

L’identification des événements au sein de textes est une tâche d’extraction d’informations importante et préalable à de nombreuses applications. Au travers des spécifications TimeML et des campagnes TempEval, cette tâche a reçu une attention particulière ces dernières années, mais aucun résultat de référence n’est disponible pour le français. Dans cet article nous tentons de répondre à ce problème en proposant plusieurs systèmes d’extraction, en faisant notamment collaborer champs aléatoires conditionnels, modèles de langues et k-plus-proches-voisins. Ces systèmes sont évalués sur le français et confrontés à l’état-de-l’art sur l’anglais. Les très bons résultats obtenus sur les deux langues valident notre approche.

@InProceedings{Arnulphy2014, 
  title = {{Techniques d’apprentissage supervisé pour l’extraction d’événements TimeML en anglais et français}},
  author = {Béatrice Arnulphy and Vincent Claveau and Xavier Tannier and Anne Vilnat},
  booktitle = {Actes de la COnférence en Recherche d'Information et ses Applications (CORIA 2014)}, 
  address = {Nancy, France}, 
  year = {2014}, 
  month = mar
}

Clément de Groc, Xavier Tannier.

Apprendre à ordonner la frontière de crawl pour le crawling orienté.

in Actes de la COnférence en Recherche d'Information et ses Applications (CORIA 2014). Nancy, France, March 2014.

ⓘ [abstract] [BibTeX] [free copy]

Focused crawling consists in searching and retrieving a set of documents relevant to a specific domain of interest from the Web. Such crawlers prioritize their fetches by relying on a crawl frontier ordering strategy. In this article, we propose to learn this ordering strategy from annotated data using learning-to-rank algorithms. Such approach allows us to cope with tunneling and to integrate a large number of heterogeneous features to guide the crawler. We describe a novel method to learn a domain-independent ranking function for topical Web crawling. We validate the relevance of our approach on "large" crawls of 40,000 documents on a set of 15 topics from the OpenDirectory, and show that our approach provides an increase in precision (harvest rate) of up to 10% compared to a baseline Shark Search algorithm. Finally, we discuss future leads regarding the application of learning-to-rank to focused Web crawling.

Le crawling orienté consiste à parcourir le Web au travers des hyperliens en orientant son parcours en direction des pages pertinentes. Pour cela, ces crawlers ordonnent leurs téléchargements suivant une stratégie d'ordonnancement. Dans cet article, nous proposons d'apprendre cette fonction d'ordonnancement à partir de données annotées. Une telle approche nous permet notamment d'intégrer un grand nombre de traits hétérogènes et de les combiner. Nous décrivons une méthode permettant d'apprendre une fonction d'ordonnancement indépendante du domaine pour la collecte thématique de documents. Nous évaluons notre approche sur de "longs" crawls de 40 000 documents sur 15 thèmes différents issus de l'OpenDirectory, et montrons que notre méthode permet d'améliorer la précision de près de 10 % par rapport à l'algorithme Shark Search. Enfin, nous discutons les avantages et inconvénients de notre approche, ainsi que les pistes de recherche ouvertes.

@InProceedings{deGroc2014, 
  title = {{Apprendre à ordonner la frontière de crawl pour le crawling orienté}},
  author = {Clément de Groc and Xavier Tannier},
  booktitle = {Actes de la COnférence en Recherche d'Information et ses Applications (CORIA 2014)}, 
  address = {Nancy, France}, 
  year = {2014}, 
  month = mar
}

Xavier Tannier, Véronique Moriceau.

Building Event Threads out of Multiple News Articles.

in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2013). Seattle, USA, 2013.

ⓘ [abstract] [BibTeX] [ACL Anthology] [Poster]

We present an approach for building multidocument event threads from a large corpus of newswire articles. An event thread is basically a succession of events belonging to the same story. It helps the reader to contextualize the information contained in a single article, by navigating backward or forward in the thread from this article. A specific effort is also made on the detection of reactions to a particular event.
In order to build these event threads, we use a cascade of classifiers and other modules, taking advantage of the redundancy of information in the newswire corpus.
We also share interesting comments concerning our manual annotation procedure for building a training and testing set

@InProceedings{Tannier2013b, 
  title = {{Building Event Threads out of Multiple News Articles}},
  author = {Xavier Tannier and Véronique Moriceau},
  booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2013)}, 
  address = {Seattle, USA}, 
  year = {2013}
}

Cyril Grouin, Natalia Grabar, Thierry Hamon, Sophie Rosset, Xavier Tannier, Pierre Zweigenbaum.

Eventual situations for timeline extraction from clinical reports.

Journal of the American Medical Informatics Association. April 2013.

ⓘ [abstract] [BibTeX] [JAMIA link] [Ask me!]

Objective. To identify the temporal relations between clinical events and temporal expressions in clinical reports, as defined in the i2b2/VA 2012 challenge.
Design. To detect clinical events, we used rules and Conditional Random Fields. We built Random Forest models to identify event modality and polarity. To identify temporal expressions we built on the HeidelTime system. To detect temporal relations, we systematically studied their breakdown into distinct situations; we designed an oracle method to determine the most prominent situations and the most suitable associated classifiers, and combined their results.
Results. We achieved F-measures of 0.8307 for event identification, based on rules, and 0.8385 for temporal expression identification. In the temporal relation task, we identified nine main situations in three groups, experimentally confirming shared intuitions: within-sentence relations, section-related time, and across-sentence relations. Logistic regression and Naïve Bayes performed best on the first and third groups, and decision trees on the second. We reached a 0.6231 global F-measure, improving by 7.5 points our official submission.
Conclusions. Carefully hand-crafted rules obtained good results for the detection of events and temporal expressions, while a combination of classifiers improved temporal link prediction. The characterization of the oracle recall of situations allowed us to point at directions where further work would be most useful for temporal relation detection: within-sentence relations and linking History of Present Illness events to the admission date. We suggest that the systematic situation breakdown proposed in this paper could also help improve other systems addressing this task.

@Article{Grouin2013, 
  title = {{Eventual situations for timeline extraction from clinical reports}},
  author = {Cyril Grouin and Natalia Grabar and Thierry Hamon and Sophie Rosset and Xavier Tannier and Pierre Zweigenbaum},
  year = {2013}, 
  month = apr, 
  journal = {Journal of the American Medical Informatics Association}
}

Rémy Kessler, Xavier Tannier, Caroline Hagège, Véronique Moriceau, André Bittar.

Extraction de dates saillantes pour la construction de chronologies thématiques.

Traitement Automatique des Langues, numéro spécial sur le traitement automatique des informations temporelles et spatiales. Vol. 53, Issue 2, 2013.

ⓘ [abstract] [BibTeX] [free copy (ATALA)]

We present an approach for detecting salient (important) dates in texts in order to automatically build event timelines from a search query (e.g. the name of an event or person, etc.). This work was carried out on a corpus of newswire texts in English provided by the Agence France Presse (AFP). In order to extract salient dates that warrant inclusion in an event timeline, we first recognize and normalize temporal expressions in texts and then use a machine-learning approach to extract salient dates that relate to a particular topic. For the time being, we have focused only on extracting the d ates and not the events to which they are related.

Nous présentons ici une approche pour la détection de dates saillantes (importantes) dans les textes dans le but de construire automatiquement des chronologies événementielles à partir de requêtes thématiques (ex. le nom d'un événement, d'une personne, etc.). Ce travail a été mené sur un corpus de dépêches en français et en anglais fourni par l'Agence France Presse (AFP). Pour extraire les dates saillantes qui méritent de figurer dans une chronologie événementielle, les expressions temporelles dans les textes doivent, dans un premier temps, être reconnues et normalisées. Nous utilisons ensuite une approche par apprentissage pour extraire les dates saillantes pour un thème donné. Nous ne nous intéressons pour le moment qu'à l'extraction des dates et non aux événements associés.

@Article{Kessler2013, 
  title = {{Extraction de dates saillantes pour la construction de chronologies thématiques}},
  author = {Rémy Kessler and Xavier Tannier and Caroline Hagège and Véronique Moriceau and André Bittar},
  number = {2}, 
  year = {2013}, 
  journal = {Traitement Automatique des Langues, numéro spécial sur le traitement automatique des informations temporelles et spatiales}, 
  volume = {53}
}

Pierre Zweigenbaum, Xavier Tannier.

Extraction des relations temporelles entre événements médicaux dans des comptes rendus hospitaliers.

in Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2013, article court). Les Sables d'Olonne, France, June 2013.

ⓘ [abstract] [BibTeX] [free copy]

The 2012 i2b2/VA challenge focused on the detection of temporal relations between events andtemporal expressions in English clinical texts. The addressed situations were much more diversethan in the TempEval challenges. We thus focused on the systematic study of 57 distinct situationsand their importance in the training corpus by using an oracle, and empirically determined thebest performing classifier for each situation, thereby achieving a 0.623 F-measure.

Le défi i2b2/VA 2012 était dédié à la détection de relations temporelles entre événementset expressions temporelles dans des comptes rendus hospitaliers en anglais. Les situationsconsidérées étaient beaucoup plus variées que dans les défis TempEval. Nous avons donc axénotre travail sur un examen systématique de 57 situations différentes et de leur importance dansle corpus d’apprentissage en utilisant un oracle, et avons déterminé empiriquement le classifieurqui se comportait le mieux dans chaque situation, atteignant ainsi une F-mesure globale de 0,623.

@InProceedings{Zweigenbaum13a, 
  title = {{Extraction des relations temporelles entre événements médicaux dans des comptes rendus hospitaliers}},
  author = {Pierre Zweigenbaum and Xavier Tannier},
  booktitle = {Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2013, article court)}, 
  address = {Les Sables d'Olonne, France}, 
  year = {2013}, 
  month = jun
}

Xavier Tannier, Véronique Moriceau, Erwan le Flem.

Une interface pour la validation et l’évaluation de chronologies thématiques.

in Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2013), session Démonstrations. Les Sables d'Olonne, France, June 2013.

ⓘ [abstract] [BibTeX]

This demo paper presents a graphical interface for the visualization and evaluation of eventtimelines built automatically from a search query on a newswire article corpus provided by theAgence France Presse (AFP). This interface also enables journalists to validate chronologies byediting and modifying them.

Cet article décrit une interface graphique de visualisation de chronologies événementiellesconstruites automatiquement à partir de requêtes thématiques en utilisant un corpus de dépêchesfourni par l’Agence France Presse (AFP). Cette interface permet également la validation deschronologies par des journalistes qui peuvent ainsi les éditer et les modifier.

@InProceedings{Tannier2013a, 
  title = {{Une interface pour la validation et l’évaluation de chronologies thématiques}},
  author = {Xavier Tannier and Véronique Moriceau and Erwan le Flem},
  booktitle = {Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2013), session Démonstrations}, 
  address = {Les Sables d'Olonne, France}, 
  year = {2013}, 
  month = jun
}

Béatrice Arnulphy, Xavier Tannier.

Entités Nommées Événement : guide d’annotation.

Research Report 2013-12, 2013. LIMSI-CNRS.

ⓘ [abstract] [BibTeX] [paper]

Ce document présente les principes et spécifications des entités nommées événement. Ce guided’annotation a été développé dans le cadre d’une thèse financée par le projet Quaero et représente un premier travail de définition des entités nommées de type événementiel,dont le but est l’intégration dans le guide d’annotation des entités nommées structurés du projetQuaero. Ces spécifications ont servi à annoter des corpusde presse écrites francophones issus des quotidiens Le Monde et L’Est Républicain.

@TechReport{Arnulphy2013, 
  title = {{Entités Nommées Événement : guide d’annotation}},
  author = {Béatrice Arnulphy and Xavier Tannier},
  number = {2013-12}, 
  year = {2013}, 
  institution = {LIMSI-CNRS}
}

Rémy Kessler, Xavier Tannier, Caroline Hagège, Véronique Moriceau, André Bittar.

Finding Salient Dates for Building Thematic Timelines.

in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012). Jeju Island, Republic of Korea, pages 730-739, July 2012. © Association for Computational Linguistics.

ⓘ [abstract] [BibTeX] [Slides] [ACL Anthology]

We present an approach for detecting salient (important) dates in texts in order to automatically build event timelines from a search query (e.g. the name of an event or person, etc.). This work was carried out on a corpus of newswire texts in English provided by the Agence France Presse (AFP). In order to extract salient dates that warrant inclusion in an event timeline, we first recognize and normalize temporal expressions in texts and then use a machine-learning approach to extract salient dates that relate to a particular topic. We focused only on extracting the dates and not the events to which they are related.

@InProceedings{Kessler2012a, 
  title = {{Finding Salient Dates for Building Thematic Timelines}},
  author = {Rémy Kessler and Xavier Tannier and Caroline Hagège and Véronique Moriceau and André Bittar},
  booktitle = {Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012)}, 
  address = {Jeju Island, Republic of Korea}, 
  year = {2012}, 
  month = jul, 
  publisher = {Association for Computational Linguistics}, 
  pages = {730-739}
}

Béatrice Arnulphy, Xavier Tannier, Anne Vilnat.

Automatically Generated Noun Lexicons for Event Extraction.

in Proceedings of the 13th International Conference on Intelligent Text Processing and Computational Linguistics (CicLing 2012). New Delhi, India, pages 219-231, March 2012. (2).

ⓘ [abstract] [BibTeX] [paper]

In this paper, we propose a method for creating automatically weighted lexicons of event names. Almost all names of events are ambiguous in context (i.e., they can be interpreted in an eventive or non-eventive reading). Therefore, weights representing the relative "eventiveness" of a noun can help for disambiguating event detection in texts. We applied our method on both French and English corpora. Our method has been applied to both French and English corpora. We performed an evaluation based upon a machine-learning approach that shows that using weighted lexicons can be a good way to improve event extraction. We also propose a study concerning the necessary size of corpus to be used for creating a valuable lexicon.

@InProceedings{Arnulphy12a, 
  title = {{Automatically Generated Noun Lexicons for Event Extraction}},
  author = {Béatrice Arnulphy and Xavier Tannier and Anne Vilnat},
  booktitle = {Proceedings of the 13th International Conference on Intelligent Text Processing and Computational Linguistics (CicLing 2012)}, 
  address = {New Delhi, India}, 
  year = {2012}, 
  month = mar, 
  volume = {2}, 
  pages = {219-231}
}

Clément de Groc, Xavier Tannier.

Experiments on Pseudo Relevance Feedback using Graph Random Walks.

in Proceedings of the 19th International Symposium on String Processing and Information Retrieval (SPIRE 2012, Short Paper). Cartagena, Colombia, pages 193-198, October 2012. (LNCS 7608).

ⓘ [abstract] [BibTeX] [Ask me!] [SpringerLink]

In this article, we present a novel graph-based approach for pseudo-relevance feedback. We model term co-occurences in a fixed window or at the document level as a graph and apply a random walk algorithm to select expansion terms. Evaluation of the proposed approach on several standard TREC and CLEF collections including the recent TREC-Microblog dataset show that the proposed approach is competitive with state-of-the-art pseudo-relevance feedback models.

@InProceedings{DeGroc12b, 
  title = {{Experiments on Pseudo Relevance Feedback using Graph Random Walks}},
  author = {Clément de Groc and Xavier Tannier},
  booktitle = {Proceedings of the 19th International Symposium on String Processing and Information Retrieval (SPIRE 2012, Short Paper)}, 
  address = {Cartagena, Colombia}, 
  year = {2012}, 
  month = oct, 
  volume = {LNCS 7608}, 
  pages = {193-198}
}

Clément de Groc, Xavier Tannier, Claude de Loupy.

Un critère de cohésion thématique fondé sur un graphe de cooccurrences.

in Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2012). Grenoble, France, pages 183-195, June 2012.

ⓘ [abstract] [BibTeX] [free copy]

In this article, we propose a novel method to weight specialized lexicons' terms according to their relevance to the underlying thematic. Our method is inspired by Web as corpus approaches and does not require any linguistic resource. Terms relationships are modelled as a graph where terms are vertices and edges represent cooccurence in Web-derived corpus. A random walk algorithm is then applied to compute terms relevance. Finally, we study the performance and stability of the metric and evaluate it in a bilingual lexicon creation context.

Dans cet article, nous définissons un nouveau critère de cohésion thématique permettant de pondérer les termes d'un lexique thématique en fonction de leur pertinence. Le critère ne requiert aucune ressource et s'inspire des approches Web as corpus pour accumuler des connaissances exogènes. Ces connaissances sont modélisées sous forme de graphe et un algorithme de marche aléatoire est appliqué pour attribuer un score à chaque terme.Après avoir étudié les performances et la stabilité du critère proposé, nous l'évaluons sur une tâche d'aide à la création de terminologies bilingues.

@InProceedings{DeGroc12a, 
  title = {{Un critère de cohésion thématique fondé sur un graphe de cooccurrences}},
  author = {Clément de Groc and Xavier Tannier and Claude de Loupy},
  booktitle = {Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2012)}, 
  address = {Grenoble, France}, 
  year = {2012}, 
  month = jun, 
  pages = {183-195}
}

Xavier Tannier, Véronique Moriceau, Béatrice Arnulphy, Ruixin He.

Evolution of Event Designation in Media: Preliminary Study.

in Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012). Istanbul, Turkey, May 2012.

ⓘ [abstract] [BibTeX] [Poster] [free copy]

Within the general purpose of information extraction,detection of event descriptions is often an important clue.An important characteristic of event designation in texts, andespecially in media, is that it changes over time. Understandinghow these designations evolve is important in information retrievaland information extraction. Our first hypothesis is that, when an eventfirst occurs, media relate it in a very descriptive way (using verbaldesignations) whereas after some time, they use shorter nominaldesignations instead. Our second hypothesis is that the number ofdifferent nominal designations for an event tends to stabilize itselfover time. In this article, we present our methodology concerning thestudy of the evolution of event designations in French documents fromthe news agency AFP. For this preliminary study, we focused on 7 topicswhich have been relatively important in France. Verbal and nominaldesignations of events have been manually annotated in manually selectedtopic-related passages. This French corpus contains a total of 2064annotations. We then provide preliminary interesting statisticalresults and observations concerning these evolutions.

@InProceedings{Tannier2012a, 
  title = {{Evolution of Event Designation in Media: Preliminary Study}},
  author = {Xavier Tannier and Véronique Moriceau and Béatrice Arnulphy and Ruixin He},
  booktitle = {Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)}, 
  address = {Istanbul, Turkey}, 
  year = {2012}, 
  month = may
}

Béatrice Arnulphy, Xavier Tannier, Anne Vilnat.

Event Nominals: Annotation Guidelines and a Manually Annotated Corpus in French.

in Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012). Istanbul, Turkey, May 2012.

ⓘ [abstract] [BibTeX] [Poster] [free copy]

Within the general purpose of information extraction, detection of event descriptions is an important clue. A word refering to an event is more powerful than a single word, because it implies a location, a time, some protagonists (persons, organizations\dots). However, if verbal designations of events are well studied and easier to detect than nominal ones, for nominal designations there is a painful lack of definition effort and resources. In this work, we focus on nominals desribing events. As our application domain is information extraction, we follow a named entity approach to describe and annotate events.
In this paper, we present a typology and annotation guidelines for event nominals annotation. We applied these materials to French newswire articles and produced an annotated corpus. We present observations about the designations used in our manually annotated corpus and the behavior of their triggers. We provide statistics concerning word ambiguity and context of use of event nominals, as well as machine learning experiments showing the difficulty of using lexicons for extracting events.

@InProceedings{Arnulphy2012b, 
  title = {{Event Nominals: Annotation Guidelines and a Manually Annotated Corpus in French}},
  author = {Béatrice Arnulphy and Xavier Tannier and Anne Vilnat},
  booktitle = {Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)}, 
  address = {Istanbul, Turkey}, 
  year = {2012}, 
  month = may
}

André Bittar, Caroline Hagège, Véronique Moriceau, Xavier Tannier, Charles Tesseidre.

Temporal Annotation: A Proposal for Guidelines and an Experiment with Inter-annotator Agreement.

in Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012). Istanbul, Turkey, May 2012.

ⓘ [abstract] [BibTeX] [free copy] [Poster]

This article presents work carried out within the framework of the ongoing ANR (French National Research Agency) project Chronolines, which focuses on the temporal processing of large news-wire corpora in English and French. The aim of the project is to create new and innovative interfaces for visualizing textual content according to temporal criteria. Extracting and normalizing the temporal information in texts through linguistic annotation is an essential step towards attaining this objective. With this goal in mind, we developed a set of guidelines for the annotation of temporal and event expressions that is intended to be compatible with the TimeML markup language, while addressing some of its pitfalls. We provide results of an initial application of these guidelines to real news-wire texts in French over several iterations of the annotation process. These results include inter-annotator agreement figures and an error analysis. Our final inter-annotator agreement figures compare favorably with those reported for the TimeBank 1.2 annotation project.

@InProceedings{Bittar2012, 
  title = {{Temporal Annotation: A Proposal for Guidelines and an Experiment with Inter-annotator Agreement}},
  author = {André Bittar and Caroline Hagège and Véronique Moriceau and Xavier Tannier and Charles Tesseidre},
  booktitle = {Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)}, 
  address = {Istanbul, Turkey}, 
  year = {2012}, 
  month = may
}

Xavier Tannier, Philippe Muller.

Evaluating Temporal Graphs Built from Texts via Transitive Reduction.

Journal of Artificial Intelligence Research. 40, pages 375-413, 2011.

ⓘ [abstract] [BibTeX] [free copy]

Temporal information has been the focus of recent attention in information extraction, leading to some standardization effort, in particular for the task of relating events in a text. This task raises the problem of comparing two annotations of a given text, because relations between events in a story are intrinsically interdependent and cannot be evaluated separately. A proper evaluation measure is also crucial in the context of a machine learning approach to the problem. Finding a common comparison referent at the text level is not obvious, and we argue here in favor of a shift from event-based measures to measures on a unique textual object, a minimal underlying temporal graph, or more formally the transitive reduction of the graph of relations between event boundaries. We support it by an investigation of its properties on synthetic data and on a well-know temporal corpus.

@Article{Tannier2011a, 
  title = {{Evaluating Temporal Graphs Built from Texts via Transitive Reduction}},
  author = {Xavier Tannier and Philippe Muller},
  year = {2011}, 
  journal = {Journal of Artificial Intelligence Research}, 
  volume = {40}, 
  pages = {375-413}
}

Clément de Groc, Xavier Tannier, Javier Couto.

GrawlTCQ: Terminology and Corpora Building by Ranking Simultaneously Terms, Queries and Documents using Graph Random Walks.

in Proceedings of ACL Workshop on Graph-based Methods for Natural Language Processing (TextGraph 2011). Portland, Oregon, USA, pages 37-41, July 2011.

ⓘ [abstract] [BibTeX] [paper]

In this paper, we present GrawlTCQ, a new bootstrapping algorithm for building specialized terminology, corpora and queries, based on a graph model. We model links between documents, terms and queries, and use a random walk with restart algorithm to compute relevance propagation. We have evaluated GrawlTCQ on an AFP English corpus of 57,441 news over 10 categories. For corpora building, GrawlTCQ outperforms the BootCaT tool, which is vastly used in the domain. For 1,000 documents retrieved, we improve mean precision by 25%. GrawlTCQ has also shown to be faster and more robust than BootCaT over iterations. % FINAL

@InProceedings{DeGroc11a, 
  title = {{GrawlTCQ: Terminology and Corpora Building by Ranking Simultaneously Terms, Queries and Documents using Graph Random Walks}},
  author = {Clément de Groc and Xavier Tannier and Javier Couto},
  booktitle = {Proceedings of ACL Workshop on Graph-based Methods for Natural Language Processing (TextGraph 2011)}, 
  address = {Portland, Oregon, USA}, 
  year = {2011}, 
  month = jul, 
  pages = {37-41}
}

Béatrice Arnulphy, Xavier Tannier, Anne Vilnat.

Un lexique pondéré des noms d'événements en français.

in Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2011, article court). Montpellier, France, July 2011.

ⓘ [abstract] [BibTeX] [paper]

This article describes a study on automatic extraction ofevent nominals in French texts. Some existing lexicons are used, aswell as some syntactic extraction rules, and a new, automatically builtlexicon is presented. This lexicon gives a value concerning the level ofambiguity of each word as an event.

Cet article décrit une étude sur l'annotation automatique des nomsd'événements dans les textes en français. Plusieurs lexiques existantssont utilisés, ainsi que des règles syntaxiques d'extraction, et un lexique composé de façon automatique, permettant de fournir une valeursur le niveau d'ambiguïté du mot en tant qu'événement. Cette nouvelle information permettrait d'aider à la désambiguïsation des noms d'événementsen contexte.

@InProceedings{Arnulphy11a, 
  title = {{Un lexique pondéré des noms d'événements en français}},
  author = {Béatrice Arnulphy and Xavier Tannier and Anne Vilnat},
  booktitle = {Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2011, article court)}, 
  address = {Montpellier, France}, 
  year = {2011}, 
  month = jul
}

Béatrice Arnulphy, Xavier Tannier, Anne Vilnat.

Vers une extraction automatique des événements dans les textes.

May 2011. Colloque international - Langage, discours, événements, Firenze, Italy

ⓘ [BibTeX] [Slides]

INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2010 evaluation campaign, which consisted of a wide range of tracks: Ad Hoc, Book, Data Centric, Interactive, QA, Link the Wiki, Relevance Feedback, Web Service Discovery and XML Mining.

@Misc{Arnulphy11b, 
  title = {{Vers une extraction automatique des événements dans les textes}},
  author = {Béatrice Arnulphy and Xavier Tannier and Anne Vilnat},
  year = {2011}, 
  month = may, 
  howpublished = {Colloque international - Langage, discours, événements, Firenze, Italy}, 
  note = {}
}

Béatrice Arnulphy, Xavier Tannier, Anne Vilnat.

Les entités nommées événement et les verbes de cause-conséquence.

in Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2010, article court). Montréal, Canada, July 2010.

ⓘ [abstract] [BibTeX] [free copy]

Few researches focus on nominal event extraction in open-domain corpora. Lists of cue words for events exist, but raise many problems of polysemy. In this article, we focus on the following hypothesis: verbs introducing cause or consequence links have good chances to have an event noun around them.

L'extraction des événements désignés par des noms est peu étudiée dans des corpus généralistes. Si des lexiques de noms déclencheurs d'événements existent, les problèmes de polysémie sont nombreux et beaucoup d'événements ne sont pas introduits par des déclencheurs. Nous nous intéressons dans cet article à une hypothèse selon laquelle les verbes induisant la cause ou la conséquence sont de bons indices quant à la présence d'événements nominaux dans leur cotexte.

@InProceedings{Arnulphy10, 
  title = {{Les entités nommées événement et les verbes de cause-conséquence}},
  author = {Béatrice Arnulphy and Xavier Tannier and Anne Vilnat},
  booktitle = {Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2010, article court)}, 
  address = {Montréal, Canada}, 
  year = {2010}, 
  month = jul
}

Olivier Galibert, Sophie Rosset, Xavier Tannier, Fanny Grandry.

Hybrid Citation Extraction from Patents.

in Proceedings of the Seventh International Language Resources and Evaluation (LREC'10). La Valette, Malta, May 2010. © ELRA.

ⓘ [abstract] [BibTeX] [free copy]

The Quaero project organized a set of evaluations of Named Entity recognition systems in 2009. One of the sub-tasks consists in extracting citations from patents, i.e. references to other documents, either other patents or general literature from English-language patents. We present in this paper the participation of LIMSI in this evaluation, with a complete system description and the evaluation results. The corpus shown that patent and non-patent citations have a very different nature. We then separated references to other patents and to general literature papers and we created a hybrid system. For patent citations, the system used rule-based expert knowledge on the form of regular expressions. The system for detecting non-patent citations, on the other hand, is purely stochastic (machine learning with CRF++). Then we mixed both approaches to provide a single output. 4 teams participated to this task and our system obtained the best results of this evaluation campaign, even if the difference between the first two systems is poorly significant.

@INPROCEEDINGS{Galibert2010b, 
  title = {{Hybrid Citation Extraction from Patents}},
  author = {Olivier Galibert and Sophie Rosset and Xavier Tannier and Fanny Grandry},
  booktitle = {Proceedings of the Seventh International Language Resources and Evaluation (LREC'10)}, 
  address = {La Valette, Malta}, 
  year = {2010}, 
  month = may, 
  publisher = {ELRA}
}

Olivier Galibert, Ludovic Quintard, Sophie Rosset, Pierre Zweigenbaum, Claire Nédellec, Sophie Aubin, Laurent Gillard, Jean-Pierre Raysz, Delphine Pois, Xavier Tannier, Louise Deléger, Dominique Laurent.

Named and specific entity detection in varied data: The Quaero Named Entity baseline evaluation.

in Proceedings of the Seventh International Language Resources and Evaluation (LREC'10). La Valette, Malta, May 2010. © ELRA.

ⓘ [abstract] [BibTeX] [free copy]

The Quaero program that promotes research and industrial innovation on technologies for automatic analysis and classification of multimedia and multilingual documents. Within its context a set of evaluations of Named Entity recognition systems was held in 2009. Four tasks were defined. The first two concerned traditional named entities in French broadcast news for one (a rerun of ESTER 2) and of OCR-ed old newspapers for the other. The third was a gene and protein name extraction in medical abstracts. The last one was the detection of references in patents. Four different partners participated, giving a total of 16 systems. We provide a synthetic descriptions of all of them classifying them by the main approaches chosen (resource-based, rules-based or statistical), without forgetting the fact that any modern system is at some point hybrid. The metric (the relatively standard Slot Error Rate) and the results are also presented and discussed. Finally, a process is ongoing with preliminary acceptance of the partners to ensure the availability for the community of all the corpora used with the exception of the non-Quaero produced ESTER 2 one.

@INPROCEEDINGS{Galibert2010a, 
  title = {{Named and specific entity detection in varied data: The Quaero Named Entity baseline evaluation}},
  author = {Olivier Galibert and Ludovic Quintard and Sophie Rosset and Pierre Zweigenbaum and Claire Nédellec and Sophie Aubin and Laurent Gillard and Jean-Pierre Raysz and Delphine Pois and Xavier Tannier and Louise Deléger and Dominique Laurent},
  booktitle = {Proceedings of the Seventh International Language Resources and Evaluation (LREC'10)}, 
  address = {La Valette, Malta}, 
  year = {2010}, 
  month = may, 
  publisher = {ELRA}
}

Claude Roux, Xavier Tannier.

Event Extraction System for Electronic Messages.

US Patent 20090235280, Xerox Corporation, 2009.

ⓘ [abstract] [BibTeX] [Link]

An event extraction system includes a temporal module which extracts temporal expressions in text content of an electronic mail message. A calendar entry generation module generates a candidate calendar entry based on an extracted temporal expression and presents it to a user for consideration as a calendar entry. The candidate calendar entry can be displayed in a transient pop up window, allowing a user to ignore the candidate entry or to accept it.

@PATENT{Roux09, 
  title = {{Event Extraction System for Electronic Messages}},
  author = {Claude Roux and Xavier Tannier},
  number = {US Patent 20090235280}, 
  year = {2009}, 
  assignee = {Xerox Corporation}, 
  yearfiled = {2008}, 
  nationality = {US}
}

Xavier Tannier, Philippe Muller.

Evaluating Temporal Graphs built from Texts via Transitive Reduction.

Research Report 2009-21, 2009. LIMSI-CNRS.

ⓘ [abstract] [BibTeX] [body in pdf] [cover in pdf]

Temporal information has been the focus of recent attention in information extraction, leading to some standardization effort, in particular for the task of relating events in a text. Part of this effort addresses the ability to compare two annotations of a given text, while relations between events in a story are intrinsically interdependent and cannot be evaluated separately. A proper evaluation measure is also crucial in the context of a machine learning approach to the problem. Finding a common comparison referent at the text level is not an obvious endeavour, and we argue here in favor of a shift from event-based measures to measures on a unique textual object, a minimal underlying temporal graph, or more formally the transitive reduction of the graph of relations between event boundaries. We support it by an investigation of its properties on synthetic data and on a well-know temporal corpus.

@TechReport{Tannier09b, 
  title = {{Evaluating Temporal Graphs built from Texts via Transitive Reduction}},
  author = {Xavier Tannier and Philippe Muller},
  number = {2009-21}, 
  year = {2009}, 
  institution = {LIMSI-CNRS}
}

Xavier Tannier, Philippe Muller.

Evaluation Metrics for Automatic Temporal Annotation of Texts.

in Proceedings of the Sixth International Language Resources and Evaluation (LREC'08). Marrakech, Morocco, 2008. © ELRA.

ⓘ [abstract] [BibTeX] [LREC link] [pdf]

Recent years have seen increasing attention in temporal processing of texts as well as a lot of standardization effort of temporal information in natural language. A central part of this information lies in the temporal relations between events described in a text, when their precise times or dates are not known. Reliable human annotation of such information is difficult, and automatic comparisons must follow procedures beyond mere precision-recall of local pieces of information, since a coherent picture can only be considered at a global level. We address the problem of evaluation metrics of such information, aiming at fair comparisons between systems, by proposing some measures taking into account the globality of a text.

@InProceedings{Tannier08b, 
  title = {{Evaluation Metrics for Automatic Temporal Annotation of Texts}},
  author = {Xavier Tannier and Philippe Muller},
  booktitle = {Proceedings of the Sixth International Language Resources and Evaluation (LREC'08)}, 
  address = {Marrakech, Morocco}, 
  year = {2008}, 
  publisher = {ELRA}, 
  editor = {European Language Resources Association (ELRA)}
}

Caroline Hagège, Xavier Tannier.

XTM: A Robust Temporal Text Processor.

in Computational Linguistics and Intelligent Text Processing, proceedings of 9th International Conference CICLing 2008. Haifa, Israel, pages 231-240, February 2008. © Springer Verlag. Lecture Notes in Computer Science (LNCS 4919).

ⓘ [abstract] [BibTeX] [pdf]

We present in this paper the work that has been developed at Xerox Research Centre Europe to build a robust temporal text processor. The aim of this processor is to extract events described in texts and to link them, when possible, to a temporal anchor. Another goal is to be able to establish temporal ordering between the events expressed in texts. One of the originalities of this work is that the temporal processor is coupled with a syntactic-semantic analyzer. The temporal module takes then advantage of syntactic and semantic information extracted from text and at the same time, syntactic and semantic processing benefits from the temporal processing performed. As a result, analysis and management of temporal information is combined with other kinds of syntactic and semantic information, making possible a more refined text understanding processor that takes into account the temporal dimension.

@InProceedings{Tannier08a, 
  title = {{XTM: A Robust Temporal Text Processor}},
  author = {Caroline Hagège and Xavier Tannier},
  booktitle = {Computational Linguistics and Intelligent Text Processing, proceedings of 9th International Conference CICLing 2008}, 
  address = {Haifa, Israel}, 
  year = {2008}, 
  month = feb, 
  publisher = {Springer Verlag}, 
  series = {Lecture Notes in Computer Science}, 
  volume = {LNCS 4919}, 
  pages = {231-240}
}

Caroline Hagège, Xavier Tannier.

XRCE-T: XIP temporal module for TempEval campaign.

in Proceedings of SemEval workshop at ACL 2007. Prague, Czech Republic, pages 492-495, June 2007. © Association for Computational Linguistics.

ⓘ [abstract] [BibTeX] [pdf]

We present the system we used for the TempEval competition. This system relies on a deep syntactic analyzer that has been extended for the treatment of temporal ex-pressions. So, together with the temporal treatment needed for TempEval purposes, further syntactico-semantic information is also calculated, making thus temporal processing a complement for a better gen-eral purpose text understanding system.

@INPROCEEDINGS{Tannier07b, 
  title = {{XRCE-T: XIP temporal module for TempEval campaign}},
  author = {Caroline Hagège and Xavier Tannier},
  booktitle = {Proceedings of SemEval workshop at ACL 2007}, 
  address = {Prague, Czech Republic}, 
  year = {2007}, 
  month = jun, 
  publisher = {Association for Computational Linguistics}, 
  pages = {492-495}
}

Philippe Muller, Xavier Tannier.

Annotating and measuring temporal relations in texts.

in Proceedings of the 20th International Conference on Computational Linguistics (Coling 04). Genève, Suisse, pages 50-56, August 2004.

ⓘ [abstract] [BibTeX] [pdf]

This paper focuses on the automated processing of temporal information in written texts, more specifically on relations between events introduced by verbs in finite clause. While this latter problem has been largely studied from a theoretical point of view, it has very rarely been applied to real texts, if ever, with quantified results. The methodology required is still to be defined, even though there have been proposals in the strictly human annotation case. We propose here both a procedure to achieve this task and a way of measuring the results. We have been testing the feasability of this on neswire articles, with promising results.

@InProceedings{Tannier04b, 
  title = {{Annotating and measuring temporal relations in texts}},
  author = {Philippe Muller and Xavier Tannier},
  booktitle = {Proceedings of the 20th International Conference on Computational Linguistics (Coling 04)}, 
  address = {Genève, Suisse}, 
  year = {2004}, 
  month = aug, 
  pages = {50-56}
}

Philippe Muller, Xavier Tannier.

Une méthode pour l'annotation de relations temporelles dans des textes et son évaluation.

in Actes de la 11ème Conférence annuelle de Traitement Automatique des Langues Naturelles. Fès, Maroc, pages 319-328, April 2004.

ⓘ [abstract] [BibTeX] [pdf]

This paper focuses on the automated processing of temporal information in written texts, more specifically on relations between events introduced by verbs in every clause. While this latter problem has been largely studied from a theoretical point of view, it has very rarely been applied to real texts, if ever, with quantified results. The methodology required is still to be defined, even though there have been proposals in the human annotation case. We propose here both a procedure to achieve this task and a way of measuring the results. We have been testing the feasability of this on neswire articles, with promising first results.

Cet article traite de l'annotation automatique d informations temporelles dans des textes et vise plus particulièrement les relations entre événements introduits par les verbes dans chaque clause. Si ce problème a mobilisé beaucoup de chercheurs sur le plan théorique, il reste en friche pour ce qui est de l'annotation automatique systématique (et son évaluation), même s'il existe des débuts de méthodologie pour faire réaliser la tâche par des humains. Nous proposons ici à la fois une méthode pour réaliser la tâche automatiquement et une manière de mesurer à quel degré l'objectif est atteint. Nous avons testé la faisabilité de ceci sur des dépêches d'agence avec des premiers résultats encourageants.

@InProceedings{Tannier04a, 
  title = {{Une méthode pour l'annotation de relations temporelles dans des textes et son évaluation}},
  author = {Philippe Muller and Xavier Tannier},
  booktitle = {Actes de la 11ème Conférence annuelle de Traitement Automatique des Langues Naturelles}, 
  address = {Fès, Maroc}, 
  year = {2004}, 
  month = apr, 
  pages = {319-328}
}

Xavier Tannier.

Calcul des relations temporelles du discours.

ⓘ [BibTeX] [ps.gz] [pdf]

A query language is a necessary interface between the user and the search engine. Extremely simplified in the case of a retrieval performed on flat documents, this language becomes more complex when dealing with structured documents. Indeed we need then to specify constraints on both content and structure. In our approach we propose to use natural language as an interface to express such requests.
This paper describes first the different steps that we perform in order to transform (in an information retrieval framework) the natural language request into a context-free semantic representation. Some structure- and domain-specific rules are then applied, in order to obtain a final form, adapted to a conversion into a formal query language. Finally we describe our first experimentations and discuss different aspects of our approach.

Le langage de requête est l'indispensable interface entre l'utilisateur et l'outil de recherche. Simplifié au maximum dans les cas où les moteurs indexent essentiellement des documents plats, il devient fort complexe lorsqu'il s'adresse à des documents structurés et qu'il s'agit de définir des contraintes portant à la fois sur la structure et le contenu. L'approche ici-décrite propose d'utiliser la langue naturelle comme interface pour exprimer de telles requêtes.
L'article décrit dans un premier temps les différentes phases qui permettent de transformer (dans un cadre de recherche d'information) la requête en langage naturel en une représentation sémantique indépendante du contexte. Des règles de simplification adaptées à la structure et au domaine du corpus sont ensuite appliquées, permettant d'obtenir une forme finale, adaptée à une conversion vers un langage de requête formel. L'article décrit enfin les expérimentations effectuées et tire les premières conclusions sur divers aspects de cette approche.

@MastersThesis{Tannier03, 
  title = {{Calcul des relations temporelles du discours}},
  author = {Xavier Tannier},
  year = {2003}, 
  school = {INSA Toulouse -- IRIT, France}
}

Adam Remaki, Jacques Ung, Pierre Pages, Perceval Wajsbürt, Elise Liu, Guillaume Faure, Thomas Petit-Jean, Xavier Tannier, Christel Gérardin.

Improving Phenotyping of Patients With Immune-Mediated Inflammatory Diseases Through Automated Processing of Discharge Summaries: Multicenter Cohort Study.

JMIR Medical Informatics. 13, April 2025. doi: 10.2196/68704

ⓘ [abstract] [BibTeX] [JMIR link]

Background: Valuable insights gathered by clinicians during their inquiries and documented in textual reports are often unavailable in the structured data recorded in electronic health records (EHRs).
Objective: This study aimed to highlight that mining unstructured textual data with natural language processing techniques complements the available structured data and enables more comprehensive patient phenotyping. A proof-of-concept for patients diagnosed with specific autoimmune diseases is presented, in which the extraction of information on laboratory tests and drug treatments is performed.
Methods: We collected EHRs available in the clinical data warehouse of the Greater Paris University Hospitals from 2012 to 2021 for patients hospitalized and diagnosed with 1 of 4 immune-mediated inflammatory diseases: systemic lupus erythematosus, systemic sclerosis, antiphospholipid syndrome, and Takayasu arteritis. Then, we built, trained, and validated natural language processing algorithms on 103 discharge summaries selected from the cohort and annotated by a clinician. Finally, all discharge summaries in the cohort were processed with the algorithms, and the extracted data on laboratory tests and drug treatments were compared with the structured data.
Results: Named entity recognition followed by normalization yielded F1-scores of 71.1 (95% CI 63.6-77.8) for the laboratory tests and 89.3 (95% CI 85.9-91.6) for the drugs. Application of the algorithms to 18,604 EHRs increased the detection of antibody results and drug treatments. For instance, among patients in the systemic lupus erythematosus cohort with positive antinuclear antibodies, the rate increased from 18.34% (752/4102) to 71.87% (2949/4102), making the results more consistent with the literature.
Conclusions: While challenges remain in standardizing laboratory tests, particularly with abbreviations, this work, based on secondary use of clinical data, demonstrates that automated processing of discharge summaries enriched the information available in structured data and facilitated more comprehensive patient profiling.

@Article{Remaki2025, 
  title = {{Improving Phenotyping of Patients With Immune-Mediated Inflammatory Diseases Through Automated Processing of Discharge Summaries: Multicenter Cohort Study}},
  author = {Remaki, Adam and Ung, Jacques and Pages, Pierre and Wajsbürt, Perceval and Liu, Elise and Faure, Guillaume and Petit-Jean, Thomas and Tannier, Xavier and Gérardin, Christel},
  year = {2025}, 
  month = apr, 
  journal = {JMIR Medical Informatics}, 
  volume = {13}, 
  doi = {10.2196/68704}
}

Chi-en Amy Tai, Xavier Tannier.

Clinical trial cohort selection using Large Language Models on n2c2 Challenges.

January 2025.

arXiv

ⓘ [abstract] [BibTeX] [arXiv]

Clinical trials are a critical process in the medical field for introducing new treatments and innovations. However, cohort selection for clinical trials is a time-consuming process that often requires manual review of patient text records for specific keywords. Though there have been studies on standardizing the information across the various platforms, Natural Language Processing (NLP) tools remain crucial for spotting eligibility criteria in textual reports. Recently, pre-trained large language models (LLMs) have gained popularity for various NLP tasks due to their ability to acquire a nuanced understanding of text. In this paper, we study the performance of large language models on clinical trial cohort selection and leverage the n2c2 challenges to benchmark their performance. Our results are promising with regard to the incorporation of LLMs for simple cohort selection tasks, but also highlight the difficulties encountered by these models as soon as fine-grained knowledge and reasoning are required.

@Misc{Tai2025, 
  title = {{Clinical trial cohort selection using Large Language Models on n2c2 Challenges}},
  author = {Tai, Chi-en Amy and Tannier, Xavier},
  year = {2025}, 
  month = jan, 
  note = {arXiv}
}

Ambre La Rosa, Marie Verdoux, Pierre Riebler, Isabelle Lolli, Christel and Daniel, Xavier Tannier, Sarah Atallah, Bertrand Baujat, Emmanuelle Kempf.

Multimodal identification of a rare head and neck cancer patient cohort in the clinical data warehouse of Greater Paris Teaching Hospital.

ESMO Real World Data and Digital Oncology. 8, June 2025. doi: 10.1016/j.esmorw.2025.100151

ⓘ [abstract] [BibTeX] [ScienceDirect Link]

Background: Ten percent of head and neck cancers (HNCs) differ from the common upper aerodigestive tract squamous-cell carcinoma. These rare HNCs can be rare because of their histology or anatomical location. The federation of clinical data warehouses (CDWs) holds potential for advancing our understanding of these pathologies. This study aimed to develop a multimodal algorithm to identify rare HNC patients in a CDW.
Materials and methods: We carried out a cross-sectional study on the CDW of a conglomerate of 38 university hospitals. We developed a multimodal classification algorithm to identify rare HNC patients by integrating International Classification of Diseases, 10th revision (ICD-10) codes, Association for the Development of Computer Science in Cytology and Pathological Anatomy (ADICAP) codes and free-text data from pathology reports using natural language processing (NLP). Algorithm performance was evaluated by an HNC medical expert using a validation set of 100 manually annotated cases.
Results: Of 333 852 cancer patients, 9141 were identified as HNC patients based on ICD-10 and ADICAP codes. The multimodal algorithm using ICD-10 or ADICAP codes or NLP-processed free text classified 4515 patients as rare HNC patients, with 2168 identified by a minimum of two data sources. It showed a 91% sensitivity and a 95% specificity when relying on multiple data sources, with a 76% positive predictive value observed for rare histology identification compared with 43% for rare topography.
Conclusions: This study demonstrates the feasibility and utility of a multimodal electronic health record-based approach to identify rare HNC patients in a CDW. Incorporating free-text and structured data improves the reliability of such cohort identification.

@Article{LaRosa2025, 
  title = {{Multimodal identification of a rare head and neck cancer patient cohort in the clinical data warehouse of Greater Paris Teaching Hospital}},
  author = {La Rosa, Ambre and Verdoux, Marie and Riebler, Pierre and Lolli, Isabelle and and Daniel, Christel and Tannier, Xavier and Atallah, Sarah and Baujat, Bertrand and Kempf, Emmanuelle},
  year = {2025}, 
  month = jun, 
  journal = {ESMO Real World Data and Digital Oncology}, 
  volume = {8}, 
  doi = {10.1016/j.esmorw.2025.100151}
}

Marco Naguib, Xavier Tannier, Aurélie Névéol.

Few-shot clinical entity recognition in English, French and Spanish: masked language models outperform generative model prompting.

in Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024. Miami, USA, November 2024. © Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.400

ⓘ [abstract] [BibTeX] [ACL Anthology]

Large language models (LLMs) have become the preferred solution for many natural language processing tasks. In low-resource environments such as specialized domains, their few-shot capabilities are expected to deliver high performance. Named Entity Recognition (NER) is a critical task in information extraction that is not covered in recent LLM benchmarks. There is a need for better understanding the performance of LLMs for NER in a variety of settings including languages other than English. This study aims to evaluate generative LLMs, employed through prompt engineering, for few-shot clinical NER. We compare 13 auto-regressive models using prompting and 16 masked models using fine-tuning on 14 NER datasets covering English, French and Spanish. While prompt-based auto-regressive models achieve competitive F1 for general NER, they are outperformed within the clinical domain by lighter biLSTM-CRF taggers based on masked models. Additionally, masked models exhibit lower environmental impact compared to auto-regressive models. Findings are consistent across the three languages studied, which suggests that LLM prompting is not yet suited for NER production in the clinical domain.

@InProceedings{Naguib2024b, 
  title = {{Few-shot clinical entity recognition in English, French and Spanish: masked language models outperform generative model prompting}},
  author = {Naguib, Marco and Tannier, Xavier and Névéol, Aurélie},
  booktitle = {Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024}, 
  address = {Miami, USA}, 
  year = {2024}, 
  month = nov, 
  publisher = {Association for Computational Linguistics}, 
  doi = {10.18653/v1/2024.findings-emnlp.400}
}

Jamil Zaghir, Marco Naguib, Mina Bjelogrlic, Aurélie Névéol, Xavier Tannier, Christian Lovis.

Prompt Engineering Paradigms for Medical Applications: Scoping Review.

Journal of Medical Internet Research. September 2024. doi: 10.2196/60501

ⓘ [abstract] [BibTeX] [JMIR Link]

Background: Prompt engineering, focusing on crafting effective prompts to large language models (LLMs), has garnered attention for its capabilities at harnessing the potential of LLMs. This is even more crucial in the medical domain due to its specialized terminology and language technicity. Clinical natural language processing applications must navigate complex language and ensure privacy compliance. Prompt engineering offers a novel approach by designing tailored prompts to guide models in exploiting clinically relevant information from complex medical texts. Despite its promise, the efficacy of prompt engineering in the medical domain remains to be fully explored.
Objective: The aim of the study is to review research efforts and technical approaches in prompt engineering for medical applications as well as provide an overview of opportunities and challenges for clinical practice.
Methods: Databases indexing the fields of medicine, computer science, and medical informatics were queried in order to identify relevant published papers. Since prompt engineering is an emerging field, preprint databases were also considered. Multiple data were extracted, such as the prompt paradigm, the involved LLMs, the languages of the study, the domain of the topic, the baselines, and several learning, design, and architecture strategies specific to prompt engineering. We include studies that apply prompt engineering–based methods to the medical domain, published between 2022 and 2024, and covering multiple prompt paradigms such as prompt learning (PL), prompt tuning (PT), and prompt design (PD).
Results: We included 114 recent prompt engineering studies. Among the 3 prompt paradigms, we have observed that PD is the most prevalent (78 papers). In 12 papers, PD, PL, and PT terms were used interchangeably. While ChatGPT is the most commonly used LLM, we have identified 7 studies using this LLM on a sensitive clinical data set. Chain-of-thought, present in 17 studies, emerges as the most frequent PD technique. While PL and PT papers typically provide a baseline for evaluating prompt-based approaches, 61% (48/78) of the PD studies do not report any nonprompt-related baseline. Finally, we individually examine each of the key prompt engineering–specific information reported across papers and find that many studies neglect to explicitly mention them, posing a challenge for advancing prompt engineering research.
Conclusions: In addition to reporting on trends and the scientific landscape of prompt engineering, we provide reporting guidelines for future studies to help advance research in the medical field. We also disclose tables and figures summarizing medical prompt engineering papers available and hope that future contributions will leverage these existing works to better advance the field.

@Article{Zaghir2024, 
  title = {{Prompt Engineering Paradigms for Medical Applications: Scoping Review}},
  author = {Zaghir, Jamil and Naguib, Marco and Bjelogrlic, Mina and Névéol, Aurélie and Tannier, Xavier and Lovis, Christian},
  year = {2024}, 
  month = sep, 
  journal = {Journal of Medical Internet Research}, 
  doi = {10.2196/60501}
}

Ariel Cohen, Alexandrine Lanson, Emmanuelle Kempf, Xavier Tannier.

Leveraging Information Redundancy of Real-World Data through Distant Supervision.

in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Torino, Italia, pages 10352–10364, May 2024. © ELRA and ICCL.

ⓘ [abstract] [BibTeX] [free copy]

We explore the task of event extraction and classification by harnessing the power of distant supervision. We present a novel text labeling method that leverages the redundancy of temporal information in a data lake. This method enables the creation of a large programmatically annotated corpus, allowing the training of transformer models using distant supervision. This aims to reduce expert annotation time, a scarce and expensive resource. Our approach utilizes temporal redundancy between structured sources and text, enabling the design of a replicable framework applicable to diverse real-world databases and use cases. We employ this method to create multiple silver datasets to reconstruct key events in cancer patients’ pathways, using clinical notes from a cohort of 380,000 oncological patients. By employing various noise label management techniques, we validate our end-to-end approach and compare it with a baseline classifier built on expert-annotated data. The implications of our work extend to accelerating downstream applications, such as patient recruitment for clinical trials, treatment effectiveness studies, survival analysis, and epidemiology research. While our study showcases the potential of the method, there remain avenues for further exploration, including advanced noise management techniques, semi-supervised approaches, and a deeper understanding of biases in the generated datasets and models.

@InProceedings{Cohen2024, 
  title = {{Leveraging Information Redundancy of Real-World Data through Distant Supervision}},
  author = {Cohen, Ariel and Lanson, Alexandrine and Kempf, Emmanuelle and Tannier, Xavier},
  booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}, 
  address = {Torino, Italia}, 
  year = {2024}, 
  month = may, 
  publisher = {ELRA and ICCL}, 
  pages = {10352–10364}
}

Nesrine Bannour, Christophe Servan, Aurélie Névéol, Xavier Tannier.

A Benchmark Evaluation of Clinical Named Entity Recognition in French.

in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Torino, Italia, pages 14-21, May 2024. © ELRA and ICCL.

ⓘ [abstract] [BibTeX] [free copy]

Background: Transformer-based language models have shown strong performance on many Natural Language Processing (NLP) tasks. Masked Language Models (MLMs) attract sustained interest because they can be adapted to different languages and sub-domains through training or fine-tuning on specific corpora while remaining lighter than modern Large Language Models (MLMs). Recently, several MLMs have been released for the biomedical domain in French, and experiments suggest that they outperform standard French counterparts. However, no systematic evaluation comparing all models on the same corpora is available. Objective: This paper presents an evaluation of masked language models for biomedical French on the task of clinical named entity recognition. Material and methods: We evaluate biomedical models CamemBERT-bio and DrBERT and compare them to standard French models CamemBERT, FlauBERT and FrAlBERT as well as multilingual mBERT using three publically available corpora for clinical named entity recognition in French. The evaluation set-up relies on gold-standard corpora as released by the corpus developers. Results: Results suggest that CamemBERT-bio outperforms DrBERT consistently while FlauBERT offers competitive performance and FrAlBERT achieves the lowest carbon footprint. Conclusion: This is the first benchmark evaluation of biomedical masked language models for French clinical entity recognition that compares model performance consistently on nested entity recognition using metrics covering performance and environmental impact.

@InProceedings{Bannour2024, 
  title = {{A Benchmark Evaluation of Clinical Named Entity Recognition in French}},
  author = {Bannour, Nesrine and Servan, Christophe and Névéol, Aurélie and Tannier, Xavier},
  booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}, 
  address = {Torino, Italia}, 
  year = {2024}, 
  month = may, 
  publisher = {ELRA and ICCL}, 
  pages = {14-21}
}

Christel Gérardin, Yuhan Xiong, Perceval Wajsbürt, Fabrice Carrat, Xavier Tannier.

Impact of translation on biomedical information extraction: an experiment on real-life clinical notes.

JMIR Medical Informatics. January 2024. doi: 10.2196/49607

ⓘ [abstract] [BibTeX] [JMIR Link]

Background:Biomedical natural language processing tasks are best performed with English models, and translation tools have undergone major improvements. On the other hand, building annotated biomedical datasets remains a challenge.
Objective: The aim of our study is to determine whether the use of English tools to extract and normalize French medical concepts on translations provides comparable performance to that of French models trained on a set of annotated French clinical notes.
Methods: We compare two methods: one involving French-language models and one involving English-language models. For the native French method, the Named Entity Recognition (NER) and normalization steps are performed separately. For the translated English method, after the first translation step, we compare a two-step method and a terminology-oriented method that performs extraction and normalization at the same time. We used French, English and bilingual annotated datasets to evaluate all stages (NER, normalization and translation) of our algorithms.
Results: The native French method outperformed the translated English method, with an overall f1 score of 0.51 [0.47;0.55], compared with 0.39 [0.34;0.44] and 0.38 [0.36;0.40] for the two English methods tested.
Conclusions: Despite recent improvements in translation models, there is a significant difference in performance between the two approaches in favor of the native French method, which is more effective on French medical texts, even with few annotated documents.

@Article{Gerardin2024, 
  title = {{Impact of translation on biomedical information extraction: an experiment on real-life clinical notes}},
  author = {Gérardin, Christel and Xiong, Yuhan and Wajsbürt, Perceval and Carrat, Fabrice and Tannier, Xavier},
  year = {2024}, 
  month = jan, 
  journal = {JMIR Medical Informatics}, 
  doi = {10.2196/49607}
}

Thomas Petit-Jean, Christel Gérardin, Emmanuelle Berthelot, Gilles Chatellier, Marie Franck, Xavier Tannier, Emmanuelle Kempf, Romain Bey.

Collaborative and privacy-enhancing workflows on a clinical data warehouse: an example developing natural language processing pipelines to detect medical conditions.

Journal of the American Medical Informatics Association. Vol. 31, Issue 6, April 2024. doi: 10.1093/jamia/ocae069

ⓘ [abstract] [BibTeX] [JAMIA Link]

Objective: To develop and validate a natural language processing (NLP) pipeline that detects 18 conditions in French clinical notes, including 16 comorbidities of the Charlson index, while exploring a collaborative and privacy-enhancing workflow.
Materials and Methods: The detection pipeline relied both on rule-based and machine learning algorithms, respectively, for named entity recognition and entity qualification, respectively. We used a large language model pre-trained on millions of clinical notes along with annotated clinical notes in the context of 3 cohort studies related to oncology, cardiology, and rheumatology. The overall workflow was conceived to foster collaboration between studies while respecting the privacy constraints of the data warehouse. We estimated the added values of the advanced technologies and of the collaborative setting.
Results: The pipeline reached macro-averaged F1-score positive predictive value, sensitivity, and specificity of 95.7 (95%CI 94.5-96.3), 95.4 (95%CI 94.0-96.3), 96.0 (95%CI 94.0-96.7), and 99.2 (95%CI 99.0-99.4), respectively. F1-scores were superior to those observed using alternative technologies or non-collaborative settings. The models were shared through a secured registry.
Conclusions: We demonstrated that a community of investigators working on a common clinical data warehouse could efficiently and securely collaborate to develop, validate and use sensitive artificial intelligence models. In particular, we provided an efficient and robust NLP pipeline that detects conditions mentioned in clinical notes.

@Article{PetitJean2024, 
  title = {{Collaborative and privacy-enhancing workflows on a clinical data warehouse: an example developing natural language processing pipelines to detect medical conditions}},
  author = {Petit-Jean, Thomas and Gérardin, Christel and Berthelot, Emmanuelle and Chatellier, Gilles and Franck, Marie and Tannier, Xavier and Kempf, Emmanuelle and Bey, Romain},
  number = {6}, 
  year = {2024}, 
  month = apr, 
  journal = {Journal of the American Medical Informatics Association}, 
  volume = {31}, 
  doi = {10.1093/jamia/ocae069}
}

Xavier Tannier, Perceval Wajsbürt, Alice Calliger, Basile Dura, Alexandre Mouchet, Martin Hilka, Romain Bey.

Development and Validation of a Natural Language Processing Algorithm to Pseudonymize Documents in the Context of a Clinical Data Warehouse.

Methods of Information in Medicine. Vol. 63, Issue 01/02, March 2024. doi: 10.1055/s-0044-1778693

ⓘ [abstract] [BibTeX] [Thieme Link]

Objective: The objective of this study is to address the critical issue of deidentification of clinical reports to allow access to data for research purposes, while ensuring patient privacy. The study highlights the difficulties faced in sharing tools and resources in this domain and presents the experience of the Greater Paris University Hospitals (AP-HP for Assistance Publique-Hôpitaux de Paris) in implementing a systematic pseudonymization of text documents from its Clinical Data Warehouse.
Methods: We annotated a corpus of clinical documents according to 12 types of identifying entities and built a hybrid system, merging the results of a deep learning model as well as manual rules.
Results and Discussion: Our results show an overall performance of 0.99 of F1-score. We discuss implementation choices and present experiments to better understand the effort involved in such a task, including dataset size, document types, language models, or rule addition. We share guidelines and code under a 3-Clause BSD license.

@Article{Tannier2024, 
  title = {{Development and Validation of a Natural Language Processing Algorithm to Pseudonymize Documents in the Context of a Clinical Data Warehouse}},
  author = {Tannier, Xavier and Wajsbürt, Perceval and Calliger, Alice and Dura, Basile and Mouchet, Alexandre and Hilka, Martin and Bey, Romain},
  number = {01/02}, 
  year = {2024}, 
  month = mar, 
  journal = {Methods of Information in Medicine}, 
  volume = {63}, 
  doi = {10.1055/s-0044-1778693}
}

Romain Bey, Ariel Cohen, Vincent Trebossen, Basile Dura, Pierre-Alexis Geoffroy, Charline Jean, Benjamin Landman, Thomas Petit-Jean, Gilles Chatellier, Kankoe Sallah, Xavier Tannier, Aurelie Bourmaud, Richard Delorme.

Natural language processing of multi-hospital electronic health records for public health surveillance of suicidality.

npj Mental Health Research. Vol. 3, Issue 6, February 2024. doi: 10.1038/s44184-023-00046-7

ⓘ [abstract] [BibTeX] [Nature Link]

There is an urgent need to monitor the mental health of large populations, especially during crises such as the COVID-19 pandemic, to timely identify the most at-risk subgroups and to design targeted prevention campaigns. We therefore developed and validated surveillance indicators related to suicidality: the monthly number of hospitalisations caused by suicide attempts and the prevalence among them of five known risks factors. They were automatically computed analysing the electronic health records of fifteen university hospitals of the Paris area, France, using natural language processing algorithms based on artificial intelligence. We evaluated the relevance of these indicators conducting a retrospective cohort study. Considering 2,911,920 records contained in a common data warehouse, we tested for changes after the pandemic outbreak in the slope of the monthly number of suicide attempts by conducting an interrupted time-series analysis. We segmented the assessment time in two sub-periods: before (August 1, 2017, to February 29, 2020) and during (March 1, 2020, to June 31, 2022) the COVID-19 pandemic. We detected 14,023 hospitalisations caused by suicide attempts. Their monthly number accelerated after the COVID-19 outbreak with an estimated trend variation reaching 3.7 (95%CI 2.1–5.3), mainly driven by an increase among girls aged 8–17 (trend variation 1.8, 95%CI 1.2–2.5). After the pandemic outbreak, acts of domestic, physical and sexual violence were more often reported (prevalence ratios: 1.3, 95%CI 1.16–1.48; 1.3, 95%CI 1.10–1.64 and 1.7, 95%CI 1.48–1.98), fewer patients died (p = 0.007) and stays were shorter (p < 0.001). Our study demonstrates that textual clinical data collected in multiple hospitals can be jointly analysed to compute timely indicators describing mental health conditions of populations. Our findings also highlight the need to better take into account the violence imposed on women, especially at early ages and in the aftermath of the COVID-19 pandemic.

@Article{Bey2024, 
  title = {{Natural language processing of multi-hospital electronic health records for public health surveillance of suicidality}},
  author = {Romain Bey and Ariel Cohen and Vincent Trebossen and Basile Dura and Pierre-Alexis Geoffroy and Charline Jean and Benjamin Landman and Thomas Petit-Jean and Gilles Chatellier and Kankoe Sallah and Xavier Tannier and Aurelie Bourmaud and Richard Delorme},
  number = {6}, 
  year = {2024}, 
  month = feb, 
  journal = {npj Mental Health Research}, 
  volume = {3}, 
  doi = {10.1038/s44184-023-00046-7}
}

Marco Naguib, Aurélie Névéol, Xavier Tannier.

Reconnaissance d’entités cliniques en few-shot en trois langues.

in Actes de la 31ème conférence Traitement Automatique des Langues Naturelles (TALN 2024). Toulouse, France, July 2024.

ⓘ [abstract] [BibTeX] [pdf]

Les grands modèles de langage deviennent la solution de choix pour de nombreuses tâches de traitement du langage naturel, y compris dans des domaines spécialisés où leurs capacités few-shot devraient permettre d’obtenir des performances élevées dans des environnements à faibles ressources. Cependant, notre évaluation de 10 modèles auto-régressifs et 16 modèles masqués montre que, bien que les modèles auto-régressifs utilisant des prompts puissent rivaliser en termes de reconnaissance d’entités nommées (REN) en dehors du domaine clinique, ils sont dépassés dans le domaine clinique par des taggers biLSTM-CRF plus légers reposant sur des modèles masqués. De plus, les modèles masqués ont un bien moindre impact environnemental que les modèles auto-régressifs. Ces résultats, cohérents dans les trois langues étudiées, suggèrent que les modèles à apprentissage few-shot ne sont pas encore adaptés à la production de REN dans le domaine clinique, mais pourraient être utilisés pour accélérer la création de données annotées de qualité.

@InProceedings{Naguib2024, 
  title = {{Reconnaissance d’entités cliniques en few-shot en trois langues}},
  author = {Naguib, Marco and Névéol, Aurélie and Tannier, Xavier},
  booktitle = {Actes de la 31ème conférence Traitement Automatique des Langues Naturelles (TALN 2024)}, 
  address = {Toulouse, France}, 
  year = {2024}, 
  month = jul
}

Solène Delourme, Adam Remaki, Christel Gérardin, Pascal Vaillant, Xavier Tannier, Brigitte Séroussi, Akram Redjdal.

LIMICS@DEFT'24 : Un mini-LLM peut-il tricher aux QCM de pharmacie en fouillant dans Wikipédia et NACHOS ?.

in Défi Fouille de Texte (DEFT), Traitement Automatique des Langues Naturelles, 2024. Toulouse, France, July 2024.

ⓘ [abstract] [BibTeX] [HAL link]

Ce papier explore deux approches pour répondre aux questions à choix multiples (QCM) de pharmacie du défi DEFT 2024 en utilisant des modèles de langue (LLMs) entraînés sur des données ouvertes avec moins de 3 milliards de paramètres. Les deux approches reposent sur l'architecture RAG (Retrieval Augmented Generation) pour combiner la récupération de contexte à partir de bases de connaissances externes (NACHOS et Wikipédia) avec la génération de réponses par le LLM Apollo-2B. La première approche traite directement les QCMs et génère les réponses en une seule étape, tandis que la seconde approche reformule les QCMs en questions binaires (Oui/Non) puis génère une réponse pour chaque question binaire. Cette dernière approche obtient un Exact Match Ratio de 14.7 et un Hamming Score de 51.6 sur le jeu de test, ce qui démontre le potentiel du RAG pour des tâches de Q/A sous de telles contraintes.

@InProceedings{Delourme2024, 
  title = {{LIMICS@DEFT'24 : Un mini-LLM peut-il tricher aux QCM de pharmacie en fouillant dans Wikipédia et NACHOS ?}},
  author = {Delourme, Solène and Remaki, Adam and Gérardin, Christel and Vaillant, Pascal and Tannier, Xavier and Séroussi, Brigitte and Redjdal, Akram},
  booktitle = {Défi Fouille de Texte (DEFT), Traitement Automatique des Langues Naturelles, 2024}, 
  address = {Toulouse, France}, 
  year = {2024}, 
  month = jul
}

Emmanuelle Kempf, Sonia Priou, Akram Redjdal, Étienne Guével, Xavier Tannier.

The More, the Better? Modalities of Metastatic Status Extraction on Free Medical Reports Based on Natural Language Processing (Response to Ahumada et al on Methodological and Practical Aspects of a Distant Metastasis Detection Model).

JCO Clinical Cancer Informatics. 8, August 2024. doi: 10.1200/CCI.24.00026

ⓘ [BibTeX] [Direct link] [Ask me!]

Ce papier explore deux approches pour répondre aux questions à choix multiples (QCM) de pharmacie du défi DEFT 2024 en utilisant des modèles de langue (LLMs) entraînés sur des données ouvertes avec moins de 3 milliards de paramètres. Les deux approches reposent sur l'architecture RAG (Retrieval Augmented Generation) pour combiner la récupération de contexte à partir de bases de connaissances externes (NACHOS et Wikipédia) avec la génération de réponses par le LLM Apollo-2B. La première approche traite directement les QCMs et génère les réponses en une seule étape, tandis que la seconde approche reformule les QCMs en questions binaires (Oui/Non) puis génère une réponse pour chaque question binaire. Cette dernière approche obtient un Exact Match Ratio de 14.7 et un Hamming Score de 51.6 sur le jeu de test, ce qui démontre le potentiel du RAG pour des tâches de Q/A sous de telles contraintes.

@Article{Kempf2024b, 
  title = {{The More, the Better? Modalities of Metastatic Status Extraction on Free Medical Reports Based on Natural Language Processing (Response to Ahumada et al on Methodological and Practical Aspects of a Distant Metastasis Detection Model)}},
  author = {Kempf, Emmanuelle and Priou, Sonia and Redjdal, Akram and Guével, Étienne and Tannier, Xavier},
  year = {2024}, 
  month = aug, 
  journal = {JCO Clinical Cancer Informatics}, 
  volume = {8}, 
  doi = {10.1200/CCI.24.00026}
}

Christel Gérardin, Adam Remaki, Jacques Ung, P Pagès, Perceval Wajsbürt, Guillaume Faure, Thomas Petit-Jean, Xavier Tannier.

Améliorer la caractérisation phénotypique des patients atteints de maladies inflammatoires à médiation immunitaire par l’analyse automatique des comptes-rendus hospitaliers.

in 89ème congrès français de médecine interne, Revue de Médecine Interne. March 2024.

ⓘ [BibTeX] [ScienceDirect Link]

Ce papier explore deux approches pour répondre aux questions à choix multiples (QCM) de pharmacie du défi DEFT 2024 en utilisant des modèles de langue (LLMs) entraînés sur des données ouvertes avec moins de 3 milliards de paramètres. Les deux approches reposent sur l'architecture RAG (Retrieval Augmented Generation) pour combiner la récupération de contexte à partir de bases de connaissances externes (NACHOS et Wikipédia) avec la génération de réponses par le LLM Apollo-2B. La première approche traite directement les QCMs et génère les réponses en une seule étape, tandis que la seconde approche reformule les QCMs en questions binaires (Oui/Non) puis génère une réponse pour chaque question binaire. Cette dernière approche obtient un Exact Match Ratio de 14.7 et un Hamming Score de 51.6 sur le jeu de test, ce qui démontre le potentiel du RAG pour des tâches de Q/A sous de telles contraintes.

@InProceedings{Gerardin2024b, 
  title = {{Améliorer la caractérisation phénotypique des patients atteints de maladies inflammatoires à médiation immunitaire par l’analyse automatique des comptes-rendus hospitaliers}},
  author = {Gérardin, Christel and Remaki, Adam and Ung, Jacques and Pagès, P and Wajsbürt, Perceval and Faure, Guillaume and Petit-Jean, Thomas and Tannier, Xavier},
  booktitle = {89ème congrès français de médecine interne, Revue de Médecine Interne}, 
  year = {2024}, 
  month = mar
}

Emmanuelle Kempf, Sonia Priou, Basile Dura, Julien Calderaro, Clara Brones, Perceval Wajsbürt, Lina Bennani, Xavier Tannier.

Structuration des critères histopronostiques tumoraux par traitement automatique du langage naturel - Une comparaison entre apprentissage machine et règles.

in Congrès ÉMOIS, Special Issue of the Journal of Epidemiology and Population Health. March 2024.

ⓘ [BibTeX] [ScienceDirect Link]

Ce papier explore deux approches pour répondre aux questions à choix multiples (QCM) de pharmacie du défi DEFT 2024 en utilisant des modèles de langue (LLMs) entraînés sur des données ouvertes avec moins de 3 milliards de paramètres. Les deux approches reposent sur l'architecture RAG (Retrieval Augmented Generation) pour combiner la récupération de contexte à partir de bases de connaissances externes (NACHOS et Wikipédia) avec la génération de réponses par le LLM Apollo-2B. La première approche traite directement les QCMs et génère les réponses en une seule étape, tandis que la seconde approche reformule les QCMs en questions binaires (Oui/Non) puis génère une réponse pour chaque question binaire. Cette dernière approche obtient un Exact Match Ratio de 14.7 et un Hamming Score de 51.6 sur le jeu de test, ce qui démontre le potentiel du RAG pour des tâches de Q/A sous de telles contraintes.

@InProceedings{Kempf2024, 
  title = {{Structuration des critères histopronostiques tumoraux par traitement automatique du langage naturel - Une comparaison entre apprentissage machine et règles}},
  author = {Kempf, Emmanuelle and Priou, Sonia and Dura, Basile and Calderaro, Julien and Brones, Clara and Wajsbürt, Perceval and Bennani, Lina and Tannier, Xavier},
  booktitle = {Congrès ÉMOIS, Special Issue of the Journal of Epidemiology and Population Health}, 
  year = {2024}, 
  month = mar
}

Perceval Wajsburt, Xavier Tannier.

An end-to-end neural model based on cliques and scopes for frame extraction in long breast radiology reports.

in The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks. Toronto, Canada, pages 156–170, July 2023. © Association for Computational Linguistics.

ⓘ [abstract] [BibTeX] [ACL Anthology]

We consider the task of automatically extracting various overlapping frames, i.e, structured entities composed of multiple labels and mentions, from long clinical breast radiology documents. While many methods exist for related topics such as event extraction, slot filling, or discontinuous entity recognition, a challenge in our study resides in the fact that clinical reports typically contain overlapping frames that span multiple sentences or paragraphs.We propose a new method that addresses these difficulties and evaluate it on a new annotated corpus. Despite the small number of documents, we show that the hybridization between knowledge injection and a learning-based system allows us to quickly obtain proper results.We will also introduce the concept of scope relations and show that it both improves the performance of our system, and provides a visual explanation of the predictions.

@InProceedings{Wajsburt2023, 
  title = {{An end-to-end neural model based on cliques and scopes for frame extraction in long breast radiology reports}},
  author = {Wajsburt, Perceval and Tannier, Xavier},
  booktitle = {The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks}, 
  address = {Toronto, Canada}, 
  year = {2023}, 
  month = jul, 
  publisher = {Association for Computational Linguistics}, 
  pages = {156–170}
}

Nesrine Bannour, Bastien Rance, Xavier Tannier, Aurelie Neveol.

Event-independent temporal positioning: application to French clinical text.

in The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks. Toronto, Canada, pages 191–205, July 2023. © Association for Computational Linguistics.

ⓘ [abstract] [BibTeX] [ACL Anthology]

Extracting temporal relations usually entails identifying and classifying the relation between two mentions. However, the definition of temporal mentions strongly depends on the text type and the application domain. Clinical text in particular is complex. It may describe events that occurred at different times, contain redundant information and a variety of domain-specific temporal expressions. In this paper, we propose a novel event-independent representation of temporal relations that is task-independent and, therefore, domain-independent. We are interested in identifying homogeneous text portions from a temporal standpoint and classifying the relation between each text portion and the document creation time. Temporal relation extraction is cast as a sequence labeling task and evaluated on oncology notes. We further evaluate our temporal representation by the temporal positioning of toxicity events of chemotherapy administrated to colon and lung cancer patients described in French clinical reports. An overall macro F-measure of 0.86 is obtained for temporal relation extraction by a neural token classification model trained on clinical texts written in French. Our results suggest that the toxicity event extraction task can be performed successfully by automatically identifying toxicity events and placing them within the patient timeline (F-measure .62). The proposed system has the potential to assist clinicians in the preparation of tumor board meetings.

@InProceedings{Bannour2023b, 
  title = {{Event-independent temporal positioning: application to French clinical text}},
  author = {Bannour, Nesrine and Rance, Bastien and Tannier, Xavier and Neveol, Aurelie},
  booktitle = {The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks}, 
  address = {Toronto, Canada}, 
  year = {2023}, 
  month = jul, 
  publisher = {Association for Computational Linguistics}, 
  pages = {191–205}
}

Étienne Guével, Sonia Priou, Guillaume Lamé, Johanna Wassermann, Romain Bey, Catherine Uzan, Gilles Chatellier, Yazid Belkacemi, Xavier Tannier, Sophie Guillerm, Rémi Flicoteaux, Joseph Gligorov, Ariel Cohen, Marc‐Antoine Benderra, Luis Teixeira, Christel Daniel, Barbara Hersant, Christophe Tournigand, Emmanuelle Kempf.

Impact of the COVID‐19 pandemic on clinical presentation, treatments, and outcomes of new breast cancer patients: A retrospective multicenter cohort study.

Cancer Medicine. November 2023. doi: 10.1002/cam4.6637

ⓘ [abstract] [BibTeX] [Direct link (Wiley)]

Background: The SARS CoV-2 pandemic disrupted healthcare systems. We compared the cancer stage for new breast cancers (BCs) before and during the pandemic.
Methods:We performed a retrospective multicenter cohort study on the data warehouse of Greater Paris University Hospitals (AP-HP). We identified all female patients newly referred with a BC in 2019 and 2020. We assessed the timeline of their care trajectories, initial tumor stage, and treatment received: BC resection, exclusive systemic therapy, exclusive radiation therapy, or exclusive best supportive care (BSC). We calculated patients' 1-year overall survival (OS) and compared indicators in 2019 and 2020.
Results:In 2019 and 2020, 2055 and 1988, new BC patients underwent cancer treatment, and during the two lockdowns, the BC diagnoses varied by −18% and by +23% compared to 2019. De novo metastatic tumors (15% and 15%, p = 0.95), pTNM and ypTNM distributions of 1332 cases with upfront resection and of 296 cases with neoadjuvant therapy did not differ (p = 0.37, p = 0.3). The median times from first multidisciplinary meeting and from diagnosis to treatment of 19 days (interquartile 11–39 days) and 35 days (interquartile 22–65 days) did not differ. Access to plastic surgery (15% and 17%, p = 0.08) and to treatment categories did not vary: tumor resection (73% and 72%), exclusive systemic therapy (13% and 14%), exclusive radiation therapy (9% and 9%), exclusive BSC (5% and 5%) (p = 0.8). Among resected patients, the neoadjuvant therapy rate was lower in 2019 (16%) versus 2020 (20%) (p = 0.02). One-year OS rates were 99.3% versus 98.9% (HR = 0.96; 95% CI, 0.77–1.2), 72.6% versus 76.6% (HR = 1.28; 95% CI, 0.95–1.72), 96.6% versus 97.8% (HR = 1.09; 95% CI, 0.61–1.94), and 15.5% versus 15.1% (HR = 0.99; 95% CI, 0.72–1.37), in the treatment groups.
Conclusions:Despite a decrease in the number of new BCs, there was no tumor stage shift, and OS did not vary.

@Article{Guevel2023b, 
  title = {{Impact of the COVID‐19 pandemic on clinical presentation, treatments, and outcomes of new breast cancer patients: A retrospective multicenter cohort study}},
  author = {Guével, Étienne and Priou, Sonia and Lamé, Guillaume and Wassermann, Johanna and Bey, Romain and Uzan, Catherine and Chatellier, Gilles and Belkacemi, Yazid and Tannier, Xavier and Guillerm, Sophie and Flicoteaux, Rémi and Gligorov, Joseph and Cohen, Ariel and Benderra, Marc‐Antoine and Teixeira, Luis and Daniel, Christel and Hersant, Barbara and Tournigand, Christophe and Kempf, Emmanuelle},
  year = {2023}, 
  month = nov, 
  journal = {Cancer Medicine}, 
  doi = {10.1002/cam4.6637}
}

Emmanuelle Kempf, Sonia Priou, Guillaume Lamé, Alexis Laurent, Etienne Guével, Stylianos Tzedakis, Romain Bey, David Fuks, Gilles Chatellier, Xavier Tannier, Gilles Galula, Rémi Flicoteaux, Christel Daniel, Christophe Tournigand.

No changes in clinical presentation, treatment strategies and survival of pancreatic cancer cases during the SARS-COV-2 outbreak: A retrospective multicenter cohort study on real-world data.

International Journal of Cancer. August 2023. doi: 10.1002/ijc.34675

ⓘ [abstract] [BibTeX] [Ask me!] [Direct link (Wiley)]

The SARS-COV-2 pandemic disrupted healthcare systems. We assessed its impact on the presentation, care trajectories and outcomes of new pancreatic cancers (PCs) in the Paris area. We performed a retrospective multicenter cohort study on the data warehouse of Greater Paris University Hospitals (AP-HP). We identified all patients newly referred with a PC between January 1, 2019, and June 30, 2021, and excluded endocrine tumors. Using claims data and health records, we analyzed the timeline of care trajectories, the initial tumor stage, the treatment categories: pancreatectomy, exclusive systemic therapy or exclusive best supportive care (BSC). We calculated patients' 1-year overall survival (OS) and compared indicators in 2019 and 2020 to 2021. We included 2335 patients. Referral fell by 29% during the first lockdown. The median time from biopsy and from first MDM to treatment were 25 days (16-50) and 21 days (11-40), respectively. Between 2019 and 2020 to 2021, the rate of metastatic tumors (36% vs 33%, P = .39), the pTNM distribution of the 464 cases with upfront tumor resection (P = .80), and the proportion of treatment categories did not vary: tumor resection (32% vs 33%), exclusive systemic therapy (49% vs 49%), exclusive BSC (19% vs 19%). The 1-year OS rates in 2019 vs 2020 to 2021 were 92% vs 89% (aHR = 1.42; 95% CI, 0.82-2.48), 52% vs 56% (aHR = 0.88; 95% CI, 0.73-1.08), 13% vs 10% (aHR = 1.00; 95% CI, 0.78-1.25), in the treatment categories, respectively. Despite an initial decrease in the number of new PCs, we did not observe any stage shift. OS did not vary significantly.

@Article{Kempf2023b, 
  title = {{No changes in clinical presentation, treatment strategies and survival of pancreatic cancer cases during the SARS-COV-2 outbreak: A retrospective multicenter cohort study on real-world data}},
  author = {Kempf, Emmanuelle and Priou, Sonia and Lamé, Guillaume and Laurent, Alexis and Guével, Etienne and Tzedakis, Stylianos and Bey, Romain and Fuks, David and Chatellier, Gilles and Tannier, Xavier and Galula, Gilles and Flicoteaux, Rémi and Daniel, Christel and Tournigand, Christophe},
  year = {2023}, 
  month = aug, 
  journal = {International Journal of Cancer}, 
  doi = {10.1002/ijc.34675}
}

Emmanuelle Kempf, Morgan Vaterkowski, Damien Leprovost, Nicolas Griffon, David Ouagne, Stéphane Bréant, Patricia Serre, Alexandre Mouchet, Bastien Rance, Gilles Chatellier, Ali Bellamine, Marie Frank, Julien Guerin, Xavier Tannier, Alain Livartowski, Martin Hilka, Christel Daniel.

How to Improve Cancer Patients ENrollment in Clinical Trials From rEal-Life Databases Using the Observational Medical Outcomes Partnership Oncology Extension: Results of the PENELOPE Initiative in Urologic Cancers.

JCO Clinical Cancer Informatics. 7, May 2023. doi: 10.1200/CCI.22.00179

ⓘ [abstract] [BibTeX] [Ask me!] [Direct link]

Purpose: To compare the computability of Observational Medical Outcomes Partnership (OMOP)-based queries related to prescreening of patients using two versions of the OMOP common data model (CDM; v5.3 and v5.4) and to assess the performance of the Greater Paris University Hospital (APHP) prescreening tool.
Materials and methods: We identified the prescreening information items being relevant for prescreening of patients with cancer. We randomly selected 15 academic and industry-sponsored urology phase I-IV clinical trials (CTs) launched at APHP between 2016 and 2021. The computability of the related prescreening criteria (PC) was defined by their translation rate in OMOP-compliant queries and by their execution rate on the APHP clinical data warehouse (CDW) containing data of 205,977 patients with cancer. The overall performance of the prescreening tool was assessed by the rate of true- and false-positive cases of three randomly selected CTs.
Results: We defined a list of 15 minimal information items being relevant for patients' prescreening. We identified 83 PC of the 534 eligibility criteria from the 15 CTs. We translated 33 and 62 PC in queries on the basis of OMOP CDM v5.3 and v5.4, respectively (translation rates of 40% and 75%, respectively). Of the 33 PC translated in the v5.3 of the OMOP CDM, 19 could be executed on the APHP CDW (execution rate of 58%). Of 83 PC, the computability rate on the APHP CDW reached 23%. On the basis of three CTs, we identified 17, 32, and 63 patients as being potentially eligible for inclusion in those CTs, resulting in positive predictive values of 53%, 41%, and 21%, respectively.
Conclusion: We showed that PC could be formalized according to the OMOP CDM and that the oncology extension increased their translation rate through better representation of cancer natural history.

@Article{Kempf2023a, 
  title = {{How to Improve Cancer Patients ENrollment in Clinical Trials From rEal-Life Databases Using the Observational Medical Outcomes Partnership Oncology Extension: Results of the PENELOPE Initiative in Urologic Cancers}},
  author = {Emmanuelle Kempf and Morgan Vaterkowski and Damien Leprovost and Nicolas Griffon and David Ouagne and Stéphane Bréant and Patricia Serre and Alexandre Mouchet and Bastien Rance and Gilles Chatellier and Ali Bellamine and Marie Frank and Julien Guerin and Xavier Tannier and Alain Livartowski and Martin Hilka and Christel Daniel},
  year = {2023}, 
  month = may, 
  journal = {JCO Clinical Cancer Informatics}, 
  volume = {7}, 
  doi = {10.1200/CCI.22.00179}
}

Marco Naguib, Aurélie Névéol, Xavier Tannier.

Stratégies d'apprentissage actif pour la reconnaissance d'entités nommées en français.

in Actes de la 30ème conférence Traitement Automatique des Langues Naturelles (TALN 2023). Paris, France, June 2023.

ⓘ [abstract] [BibTeX] [pdf]

L'annotation manuelle de corpus est un processus coûteux et lent, notamment pour la tâche de re-connaissance d'entités nommées. L'apprentissage actif vise à rendre ce processus plus efficace, ensélectionnant les portions les plus pertinentes à annoter. Certaines stratégies visent à sélectionner lesportions les plus représentatives du corpus, d'autres, les plus informatives au modèle de langage.Malgré un intérêt grandissant pour l'apprentissage actif, rares sont les études qui comparent cesdifférentes stratégies dans un contexte de reconnaissance d'entités nommées médicales. Nous pro-posons une comparaison de ces stratégies en fonction des performances de chacune sur 3 corpus dedocuments cliniques en langue française : MERLOT, QuaeroFrenchMed et E3C. Nous comparonsles stratégies de sélection mais aussi les différentes façons de les évaluer. Enfin, nous identifions lesstratégies qui semblent les plus efficaces et mesurons l'amélioration qu'elles présentent, à différentesphases de l'apprentissage.

@InProceedings{Naguib2023, 
  title = {{Stratégies d'apprentissage actif pour la reconnaissance d'entités nommées en français}},
  author = {Naguib, Marco and Névéol, Aurélie and Tannier, Xavier},
  booktitle = {Actes de la 30ème conférence Traitement Automatique des Langues Naturelles (TALN 2023)}, 
  address = {Paris, France}, 
  year = {2023}, 
  month = jun
}

Nesrine Bannour, Xavier Tannier, Bastien Rance, Aurélie Névéol.

Positionnement temporel indépendant des évènements : application à des textes cliniques en français.

in Actes de la 30ème conférence Traitement Automatique des Langues Naturelles (TALN 2023). Paris, France, June 2023.

ⓘ [abstract] [BibTeX] [pdf]

L'extraction de relations temporelles consiste à identifier et classifier la relation entre deux mentions. Néanmoins, la définition des mentions temporelles dépend largement du type du texte et du domained'application. En particulier, le texte clinique est complexe car il décrit des évènements se produisant à des moments différents et contient des informations redondantes et diverses expressions temporellesspécifiques au domaine. Dans cet article, nous proposons une nouvelle représentation des relations temporelles, qui est indépendante du domaine et de l'objectif de la tâche d'extraction. Nous nousintéressons à extraire la relation entre chaque portion du texte et la date de création du document. Nous formulons l'extraction de relations temporelles comme une tâche d'étiquetage de séquences.Une macro F-mesure de 0,8 est obtenue par un modèle neuronal entraîné sur des textes cliniques, écrits en français. Nous évaluons notre représentation temporelle par le positionnement temporel desévènements de toxicité des chimiothérapies.

@InProceedings{Bannour2023, 
  title = {{Positionnement temporel indépendant des évènements : application à des textes cliniques en français}},
  author = {Bannour, Nesrine and Tannier, Xavier and Rance, Bastien and Névéol, Aurélie},
  booktitle = {Actes de la 30ème conférence Traitement Automatique des Langues Naturelles (TALN 2023)}, 
  address = {Paris, France}, 
  year = {2023}, 
  month = jun
}

Étienne Guével, Sonia Priou, Rémi Flicoteaux, Romain Bey, Xavier Tannier, Ariel Cohen, Gilles Chatellier, Christel Daniel, Christophe Tournigand, Emmanuelle Kempf.

Development of a Natural Language Processing Model for deriving breast cancer quality indicators: A cross-sectional, multicenter study.

in Revue d'Épidémiologie et de Santé Publique. December 2023.

ⓘ [BibTeX] [ScienceDirect Link]

L'extraction de relations temporelles consiste à identifier et classifier la relation entre deux mentions. Néanmoins, la définition des mentions temporelles dépend largement du type du texte et du domained'application. En particulier, le texte clinique est complexe car il décrit des évènements se produisant à des moments différents et contient des informations redondantes et diverses expressions temporellesspécifiques au domaine. Dans cet article, nous proposons une nouvelle représentation des relations temporelles, qui est indépendante du domaine et de l'objectif de la tâche d'extraction. Nous nousintéressons à extraire la relation entre chaque portion du texte et la date de création du document. Nous formulons l'extraction de relations temporelles comme une tâche d'étiquetage de séquences.Une macro F-mesure de 0,8 est obtenue par un modèle neuronal entraîné sur des textes cliniques, écrits en français. Nous évaluons notre représentation temporelle par le positionnement temporel desévènements de toxicité des chimiothérapies.

@InProceedings{Guevel2023, 
  title = {{Development of a Natural Language Processing Model for deriving breast cancer quality indicators: A cross-sectional, multicenter study}},
  author = {Guével, Étienne and Priou, Sonia and Flicoteaux, Rémi and Bey, Romain and Tannier, Xavier and Cohen, Ariel and Chatellier, Gilles and Daniel, Christel and Tournigand, Christophe and Kempf, Emmanuelle},
  booktitle = {Revue d'Épidémiologie et de Santé Publique}, 
  year = {2023}, 
  month = dec
}

Christel Gérardin, Y Xong, Arsène Mekinian, Fabrice Carrat, Xavier Tannier.

AB1767-HPR Document Search In Large Rheumatology Databases: Advanced Keyword Queries To Select Homogeneous Phenotypes.

in Annals of the Rheumatic Diseases, Health Professionals in Rheumatology Abstracts. pages 2117-2118, May 2023.

ⓘ [BibTeX] [BMJ Link]

L'extraction de relations temporelles consiste à identifier et classifier la relation entre deux mentions. Néanmoins, la définition des mentions temporelles dépend largement du type du texte et du domained'application. En particulier, le texte clinique est complexe car il décrit des évènements se produisant à des moments différents et contient des informations redondantes et diverses expressions temporellesspécifiques au domaine. Dans cet article, nous proposons une nouvelle représentation des relations temporelles, qui est indépendante du domaine et de l'objectif de la tâche d'extraction. Nous nousintéressons à extraire la relation entre chaque portion du texte et la date de création du document. Nous formulons l'extraction de relations temporelles comme une tâche d'étiquetage de séquences.Une macro F-mesure de 0,8 est obtenue par un modèle neuronal entraîné sur des textes cliniques, écrits en français. Nous évaluons notre représentation temporelle par le positionnement temporel desévènements de toxicité des chimiothérapies.

@InProceedings{Gerardin2023, 
  title = {{AB1767-HPR Document Search In Large Rheumatology Databases: Advanced Keyword Queries To Select Homogeneous Phenotypes}},
  author = {Gérardin, Christel and Xong, Y and Mekinian, Arsène and Carrat, Fabrice and Tannier, Xavier},
  booktitle = {Annals of the Rheumatic Diseases, Health Professionals in Rheumatology Abstracts}, 
  year = {2023}, 
  month = may, 
  pages = {2117-2118}
}

S. Priou, E. Guével, G. Lamé, J. Wassermann, R. Bey, C. Uzan, G. Chatellier, Y. Belkacémi, X. Tannier, S. Guillerm, R. Flicoteaux, J. Gligorov, A. Cohen, M-A. Benderra, L. Teixeira, C. Daniel, C. Tournigand, E. Kempf.

463P Impact of two waves of Sars-Cov-2 outbreak on the clinical presentation and outcomes of newly referred breast cancer cases at AP-HP: A retrospective multicenter cohort study.

in Annals of the Rheumatic Diseases, European Society for Medical Oncology Abstracts. October 2023. (34). doi: 10.1016/j.annonc.2023.09.639

ⓘ [BibTeX] [Link]

L'extraction de relations temporelles consiste à identifier et classifier la relation entre deux mentions. Néanmoins, la définition des mentions temporelles dépend largement du type du texte et du domained'application. En particulier, le texte clinique est complexe car il décrit des évènements se produisant à des moments différents et contient des informations redondantes et diverses expressions temporellesspécifiques au domaine. Dans cet article, nous proposons une nouvelle représentation des relations temporelles, qui est indépendante du domaine et de l'objectif de la tâche d'extraction. Nous nousintéressons à extraire la relation entre chaque portion du texte et la date de création du document. Nous formulons l'extraction de relations temporelles comme une tâche d'étiquetage de séquences.Une macro F-mesure de 0,8 est obtenue par un modèle neuronal entraîné sur des textes cliniques, écrits en français. Nous évaluons notre représentation temporelle par le positionnement temporel desévènements de toxicité des chimiothérapies.

@InProceedings{Priou2023, 
  title = {{463P Impact of two waves of Sars-Cov-2 outbreak on the clinical presentation and outcomes of newly referred breast cancer cases at AP-HP: A retrospective multicenter cohort study}},
  author = {Priou, S. and Guével, E. and Lamé, G. and Wassermann, J. and Bey, R. and Uzan, C. and Chatellier, G. and Belkacémi, Y. and Tannier, X. and Guillerm, S. and Flicoteaux, R. and Gligorov, J. and Cohen, A. and Benderra, M-A. and Teixeira, L. and Daniel, C. and Tournigand, C. and Kempf, E.},
  booktitle = {Annals of the Rheumatic Diseases, European Society for Medical Oncology Abstracts}, 
  year = {2023}, 
  month = oct, 
  volume = {34}, 
  doi = {10.1016/j.annonc.2023.09.639}
}

Christel Gérardin, Arthur Mageau, Arsène Mékinian, Xavier Tannier, Fabrice Carrat.

Construction of Cohorts of Similar Patients From Automatic Extraction of Medical Concepts: Phenotype Extraction Study.

JMIR Medical Informatics. Vol. 10, Issue 12, December 2022. doi: 10.2196/42379

ⓘ [abstract] [BibTeX] [JMIR link]

Background: Reliable and interpretable automatic extraction of clinical phenotypes from large electronic medical record databases remains a challenge, especially in a language other than English.
Objective: We aimed to provide an automated end-to-end extraction of cohorts of similar patients from electronic health records for systemic diseases.
Methods: Our multistep algorithm includes a named-entity recognition step, a multilabel classification using medical subject headings ontology, and the computation of patient similarity. A selection of cohorts of similar patients on a priori annotated phenotypes was performed. Six phenotypes were selected for their clinical significance: P1, osteoporosis; P2, nephritis in systemic erythematosus lupus; P3, interstitial lung disease in systemic sclerosis; P4, lung infection; P5, obstetric antiphospholipid syndrome; and P6, Takayasu arteritis. We used a training set of 151 clinical notes and an independent validation set of 256 clinical notes, with annotated phenotypes, both extracted from the Assistance Publique-Hôpitaux de Paris data warehouse. We evaluated the precision of the 3 patients closest to the index patient for each phenotype with precision-at-3 and recall and average precision.
Results: For P1-P4, the precision-at-3 ranged from 0.85 (95% CI 0.75-0.95) to 0.99 (95% CI 0.98-1), the recall ranged from 0.53 (95% CI 0.50-0.55) to 0.83 (95% CI 0.81-0.84), and the average precision ranged from 0.58 (95% CI 0.54-0.62) to 0.88 (95% CI 0.85-0.90). P5-P6 phenotypes could not be analyzed due to the limited number of phenotypes.
Conclusions: Using a method close to clinical reasoning, we built a scalable and interpretable end-to-end algorithm for extracting cohorts of similar patients.

@Article{Gérardin2022b, 
  title = {{Construction of Cohorts of Similar Patients From Automatic Extraction of Medical Concepts: Phenotype Extraction Study}},
  author = {Gérardin, Christel and Mageau, Arthur and Mékinian, Arsène and Tannier, Xavier and Carrat, Fabrice},
  number = {12}, 
  year = {2022}, 
  month = dec, 
  journal = {JMIR Medical Informatics}, 
  volume = {10}, 
  doi = {10.2196/42379}
}

Christel Gérardin, Perceval Wajsbürt, Pascal Vaillant, Ali Bellamine, Fabrice Carrat, Xavier Tannier.

Multilabel classification of medical concepts for patient clinical profile identification.

Artificial Intelligence in Medicine. 128, June 2022. doi: 10.1016/j.artmed.2022.102311

ⓘ [abstract] [BibTeX] [Ask me!] [ScienceDirect link]

Highlights

Extracting key informations from clinical narratives is a NLP Challenge.
There is a particular need to improve NLP tasks in languages other than English.
Our approach allows automatic pathological domains detection from clinical notes.
Using multilingual vocabularies and multilingual model leads to better results.

Abstract
Background: The development of electronic health records has provided a large volume of unstructured biomedical information. Extracting patient characteristics from these data has become a major challenge, especially in languages other than English.
Objective: We developed a methodology able to explore and target topics of interest via an interactive user interface for health professionals with limited computer science knowledge. We aim to reach near state-of-the-art performance while reducing memory consumption, increasing scalability, and minimizing user interaction effort to improve the clinical decision-making process. The performance was evaluated on diabetes-related abstracts from PubMed.
Methods: Inspired by the French Text Mining Challenge (DEFT 2021) [1] in which we participated, our study proposes a multilabel classification of clinical narratives, allowing us to automatically extract the main features of a patient report. Our system is an end-to-end pipeline from raw text to labels with two main steps: named entity recognition and multilabel classification. Both steps are based on a neural network architecture based on transformers. To train our final classifier, we extended the dataset with all English and French Unified Medical Language System (UMLS) vocabularies related to human diseases. We focus our study on the multilingualism of training resources and models, with experiments combining French and English in different ways (multilingual embeddings or translation).
Results: We obtained an overall average micro-F1 score of 0.811 for the multilingual version, 0.807 for the French-only version and 0.797 for the translated version.
Conclusion: Our study proposes an original multilabel classification of French clinical notes for patient phenotyping. We show that a multilingual algorithm trained on annotated real clinical notes and UMLS vocabularies leads to the best results.

@Article{Gérardin2022, 
  title = {{Multilabel classification of medical concepts for patient clinical profile identification}},
  author = {Gérardin, Christel and Wajsbürt, Perceval and Vaillant, Pascal and Bellamine, Ali and Carrat, Fabrice and Tannier, Xavier},
  year = {2022}, 
  month = jun, 
  journal = {Artificial Intelligence in Medicine}, 
  volume = {128}, 
  doi = {10.1016/j.artmed.2022.102311}
}

Nesrine Bannour, Perceval Wajsbürt, Bastien Rance, Xavier Tannier, Aurélie Névéol.

Privacy-Preserving Mimic Models for clinical Named Entity Recognition in French.

Journal of Biomedical Informatics. 130, June 2022. doi: 10.1016/j.jbi.2022.104073

ⓘ [abstract] [BibTeX] [Ask me!] [ScienceDirect link]

Highlights

We propose Privacy-Preserving Mimic Models for clinical named entity recognition.
Models are trained without processing any sensitive data or private model weights.
Mimic models achieve up to 0.706 macro exact F-measure on 15 clinical entity types.
Our approach offers a good compromise between performance and privacy preservation.

Abstract
A vast amount of crucial information about patients resides solely in unstructured clinical narrative notes. There has been a growing interest in clinical Named Entity Recognition (NER) task using deep learning models. Such approaches require sufficient annotated data. However, there is little publicly available annotated corpora in the medical field due to the sensitive nature of the clinical text. In this paper, we tackle this problem by building privacy-preserving shareable models for French clinical Named Entity Recognition using the mimic learning approach to enable the knowledge transfer through a teacher model trained on a private corpus to a student model. This student model could be publicly shared without any access to the original sensitive data. We evaluated three privacy-preserving models using three medical corpora and compared the performance of our models to those of baseline models such as dictionary-based models. An overall macro F-measure of 70.6% could be achieved by a student model trained using silver annotations produced by the teacher model, compared to 85.7% for the original private teacher model. Our results revealed that these privacy-preserving mimic learning models offer a good compromise between performance and data privacy preservation.

@Article{Bannour2022, 
  title = {{Privacy-Preserving Mimic Models for clinical Named Entity Recognition in French}},
  author = {Bannour, Nesrine and Wajsbürt, Perceval and Rance, Bastien and Tannier, Xavier and Névéol, Aurélie},
  year = {2022}, 
  month = jun, 
  journal = {Journal of Biomedical Informatics}, 
  volume = {130}, 
  doi = {10.1016/j.jbi.2022.104073}
}

Adrian Ahne, Guy Fagherazzi, Xavier Tannier, Thomas Czernichow, Francisco Orchard.

Improving Diabetes-Related Biomedical Literature Exploration in the Clinical Decision-making Process via Interactive Classification and Topic Discovery: Methodology Development Study.

Journal of Medical Internet Research. Vol. 24, Issue 1, January 2022. doi: 10.2196/27434

ⓘ [abstract] [BibTeX] [JMIR link]

Background:The amount of available textual health data such as scientific and biomedical literature is constantly growing and becoming more and more challenging for health professionals to properly summarize those data and practice evidence-based clinical decision making. Moreover, the exploration of unstructured health text data is challenging for professionals without computer science knowledge due to limited time, resources, and skills. Current tools to explore text data lack ease of use, require high computational efforts, and incorporate domain knowledge and focus on topics of interest with difficulty.
Objective:We developed a methodology able to explore and target topics of interest via an interactive user interface for health professionals with limited computer science knowledge. We aim to reach near state-of-the-art performance while reducing memory consumption, increasing scalability, and minimizing user interaction effort to improve the clinical decision-making process. The performance was evaluated on diabetes-related abstracts from PubMed.
Methods:The methodology consists of 4 parts: (1) a novel interpretable hierarchical clustering of documents where each node is defined by headwords (words that best represent the documents in the node), (2) an efficient classification system to target topics, (3) minimized user interaction effort through active learning, and (4) a visual user interface. We evaluated our approach on 50,911 diabetes-related abstracts providing a hierarchical Medical Subject Headings (MeSH) structure, a unique identifier for a topic. Hierarchical clustering performance was compared against the implementation in the machine learning library scikit-learn. On a subset of 2000 randomly chosen diabetes abstracts, our active learning strategy was compared against 3 other strategies: random selection of training instances, uncertainty sampling that chooses instances about which the model is most uncertain, and an expected gradient length strategy based on convolutional neural networks (CNNs).
Results:For the hierarchical clustering performance, we achieved an F1 score of 0.73 compared to 0.76 achieved by scikit-learn. Concerning active learning performance, after 200 chosen training samples based on these strategies, the weighted F1 score of all MeSH codes resulted in a satisfying 0.62 F1 score using our approach, 0.61 using the uncertainty strategy, 0.63 using the CNN, and 0.45 using the random strategy. Moreover, our methodology showed a constant low memory use with increased number of documents.
Conclusions:We proposed an easy-to-use tool for health professionals with limited computer science knowledge who combine their domain knowledge with topic exploration and target specific topics of interest while improving transparency. Furthermore, our approach is memory efficient and highly parallelizable, making it interesting for large Big Data sets. This approach can be used by health professionals to gain deep insights into biomedical literature to ultimately improve the evidence-based clinical decision making process.

@Article{Ahne2022, 
  title = {{Improving Diabetes-Related Biomedical Literature Exploration in the Clinical Decision-making Process via Interactive Classification and Topic Discovery: Methodology Development Study}},
  author = {Ahne, Adrian and Fagherazzi, Guy and Tannier, Xavier and Czernichow, Thomas and Orchard, Francisco},
  number = {1}, 
  year = {2022}, 
  month = jan, 
  journal = {Journal of Medical Internet Research}, 
  volume = {24}, 
  doi = {10.2196/27434}
}

Adrian Ahne, Vivek Khetan, Xavier Tannier, Md Imbesat Hassan Rizvi, Thomas Czernichow, Francisco Orchard, Charline Bour, Andrew Fano, Guy Fagherazzi.

Extraction of Explicit and Implicit Cause-Effect Relationships in Patient-Reported Diabetes-Related Tweets From 2017 to 2021: Deep Learning Approach.

JMIR Medical Informatics. Vol. 10, Issue 7, July 2022. doi: 10.2196/37201

ⓘ [abstract] [BibTeX] [JMIR link]

Abstract
Background: Intervening in and preventing diabetes distress requires an understanding of its causes and, in particular, from a patient's perspective. Social media data provide direct access to how patients see and understand their disease and consequently show the causes of diabetes distress.
Objective: Leveraging machine learning methods, we aim to extract both explicit and implicit cause-effect relationships in patient-reported diabetes-related tweets and provide a methodology to better understand the opinions, feelings, and observations shared within the diabetes online community from a causality perspective.
Methods:More than 30 million diabetes-related tweets in English were collected between April 2017 and January 2021. Deep learning and natural language processing methods were applied to focus on tweets with personal and emotional content. A cause-effect tweet data set was manually labeled and used to train (1) a fine-tuned BERTweet model to detect causal sentences containing a causal relation and (2) a conditional random field model with Bidirectional Encoder Representations from Transformers (BERT)-based features to extract possible cause-effect associations. Causes and effects were clustered in a semisupervised approach and visualized in an interactive cause-effect network.
Results:Causal sentences were detected with a recall of 68% in an imbalanced data set. A conditional random field model with BERT-based features outperformed a fine-tuned BERT model for cause-effect detection with a macro recall of 68%. This led to 96,676 sentences with cause-effect relationships. "Diabetes" was identified as the central cluster followed by "death" and "insulin." Insulin pricing-related causes were frequently associated with death.
Conclusions:A novel methodology was developed to detect causal sentences and identify both explicit and implicit, single and multiword cause, and the corresponding effect, as expressed in diabetes-related tweets leveraging BERT-based architectures and visualized as cause-effect network. Extracting causal associations in real life, patient-reported outcomes in social media data provide a useful complementary source of information in diabetes research.

@Article{Ahne2022b, 
  title = {{Extraction of Explicit and Implicit Cause-Effect Relationships in Patient-Reported Diabetes-Related Tweets From 2017 to 2021: Deep Learning Approach}},
  author = {Adrian Ahne and Vivek Khetan and Xavier Tannier and Md Imbesat Hassan Rizvi and Thomas Czernichow and Francisco Orchard and Charline Bour and Andrew Fano and Guy Fagherazzi},
  number = {7}, 
  year = {2022}, 
  month = jul, 
  journal = {JMIR Medical Informatics}, 
  volume = {10}, 
  doi = {10.2196/37201}
}

Sonia Priou, Guillaume Lamé, Gérard Zalcman, Marie Wislez, Romain Bey, Gilles Chatellier, Jacques Cadranel, Xavier Tannier, Laurent Zelek, Christel Daniel, Christophe Tournigand, Emmanuelle Kempf.

Influence of the SARS-CoV-2 outbreak on management and prognosis of new lung cancer cases, a retrospective multicenter real-life cohort study.

European Journal of Cancer. 173, pages 33-40, September 2022. doi: 10.1016/j.ejca.2022.06.018

ⓘ [abstract] [BibTeX] [ScienceDirect link] [Ask me!]

Highlights

During the 1 st SARS-CoV2 lockdown, the number of new lung cancers decreased by 32%.
In 6,240 cases, initial tumor stage, treatment categories did not vary (2018–2021).
Delay between multidisciplinary boards and cancer treatments did not vary over time.
Overall survival of patients diagnosed after the outbreak did not impair.
COVID was associated with poorer OS in patients with systemic anticancer therapy.

Abstract
Introduction: The SARS-CoV-2 pandemic has impacted the care of cancer patients. This study sought to assess the pandemic’s impact on the clinical presentations and outcomes of newly referred patients with lung cancer from the Greater Paris area.
Methods:We retrospectively retrieved the electronic health records and administrative data of 11.4 million patients pertaining to Greater Paris University Hospital (AP-HP). We compared indicators for the 2018-2019 period to those of 2020 in regard to newly referred lung cancer cases. We assessed initial tumor stage, delay between first multidisciplinary tumor board (MTB) and anticancer treatment initiation, and 6-month overall survival (OS) rates depending on the anticancer treatment including surgery, palliative systemic treatment, and best supportive care (BSC).
Result:Among 6,240 patients with lung cancer, 2,179 (35%) underwent tumor resection, 2,069 (33%) systemic anticancer therapy, 775 (12%) BSC, whereas 1,217 (20%) did not receive any treatment. During the first lockdown, the rate of new diagnoses decreased by 32% compared with that recorded in 2018-2019. Initial tumor stage, repartition of patients among treatment categories, and MTB-related delays remained unchanged. The 6-month OS rates of patients diagnosed in 2018-2019 who underwent tumor resection were 98% vs. 97% (HR=1.2; 95% CI: 0.7-2.0) for those diagnosed in 2020; the respective rates for patients who underwent systemic anticancer therapy were 78% vs. 79% (HR=1.0; 95% CI: 0.8-1.2); these rates were 20% vs. 13% (HR=1.3; 95% CI: 1.1-1.6) for those who received BSC. COVID-19 was associated with poorer OS rates (HR=2.1; 95% CI: 1.6-3.0) for patients who received systemic anticancer therapy.
Conclusions:The SARS-CoV-2 pandemic has not exerted any deleterious impact on 6-month OS of new lung cancer patients that underwent active anticancer therapy in Greater Paris University hospitals.

@Article{Priou2022, 
  title = {{Influence of the SARS-CoV-2 outbreak on management and prognosis of new lung cancer cases, a retrospective multicenter real-life cohort study}},
  author = {Sonia Priou and Guillaume Lamé and Gérard Zalcman and Marie Wislez and Romain Bey and Gilles Chatellier and Jacques Cadranel and Xavier Tannier and Laurent Zelek and Christel Daniel and Christophe Tournigand and Emmanuelle Kempf},
  year = {2022}, 
  month = sep, 
  journal = {European Journal of Cancer}, 
  volume = {173}, 
  pages = {33-40}, 
  doi = {10.1016/j.ejca.2022.06.018}
}

Nesrine Bannour, Perceval Wajsbürt, Bastien Rance, Xavier Tannier, Aurélie Névéol.

Modèles préservant la confidentialité des données par mimétisme pour la reconnaissance d’entités nommées en français.

in Actes de la journée d’étude sur la robustesse des systemes de TAL. Paris, France, December 2022.

ⓘ [BibTeX] [Link to free copy]

Les données de vie réelle suscitent un intérêt croissant des agences sanitaires à travers le monde, que ce soit pour étudier l’usage, l’efficacité et la sécurité des produits de santé, suivre et améliorer la qualité des soins, réaliser des études épidémiologiques ou faciliter la veille sanitaire. Parmi les différentes sources de données, les entrepôts de données de santé hospitaliers (EDSH) connaissent actuellement un développement rapide sur le territoire français. Dans la perspective de mobiliser ces données dans le cadre de ses missions, la Haute Autorité de santé (HAS) a souhaité mieux comprendre cette dynamique et le potentiel de ces données. Elle a initié en novembre 2021 un travail de recherche visant à dresser un état des lieux des EDSH en France. Ce rapport pose d’abord le cadre et des définitions en s’appuyant sur la littérature. Il détaille ensuite la méthodologie de recherche, fondée sur des entretiens menés auprès d’acteurs impliqués dans les EDSH de 17 CHU et 5 autres établissements hospitaliers. Le résultat de ces entretiens est structuré par thématiques : historique, gouvernance, données intégrées, usages couverts, transparence, architecture technique et qualité de la donnée. Ce rapport discute ensuite les points d’attention identifiés pour le bon développement des EDSH et des usages secondaires des données. Il ébauche enfin deux cas d’usages pertinents pour la HAS.

@InProceedings{Bannour2022b, 
  title = {{Modèles préservant la confidentialité des données par mimétisme pour la reconnaissance d’entités nommées en français}},
  author = {Bannour, Nesrine and Wajsbürt, Perceval and Rance, Bastien and Tannier, Xavier and Névéol, Aurélie},
  booktitle = {Actes de la journée d’étude sur la robustesse des systemes de TAL}, 
  address = {Paris, France}, 
  year = {2022}, 
  month = dec
}

Perceval Wajsbürt, Arnaud Sarfati, Xavier Tannier.

Medical concept normalization in French using multilingual terminologies and contextual embeddings.

Journal of Biomedical Informatics. 114, January 2021. doi: 10.1016/j.jbi.2021.103684

ⓘ [abstract] [BibTeX] [Ask me!] [ScienceDirect link]

Highlights

We train a model to normalize medical entities in French with a very large list of concepts.
Our method is a neural network model that requires no prior translation.
Multilingual training data improves the performance of medical normalization in French.
Multilingual embeddings are of less importance than multilingual data.

Introduction: Concept normalization is the task of linking terms from textual medical documents to their concept in terminologies such as the UMLS®. Traditional approaches to this problem depend heavily on the coverage of available resources, which poses a problem for languages other than English.
Objective: We present a system for concept normalization in French. We consider textual mentions already extracted and labeled by a named entity recognition system, and we classify these mentions with a UMLS concept unique identifier. We take advantage of the multilingual nature of available terminologies and embedding models to improve concept normalization in French without translation nor direct supervision.
Materials and methods: We consider the task as a highly-multiclass classification problem. The terms are encoded with contextualized embeddings and classified via cosine similarity and softmax. A first step uses a subset of the terminology to finetune the embeddings and train the model. A second step adds the entire target terminology, and the model is trained further with hard negative selection and softmax sampling.
Results: On two corpora from the Quaero FrenchMed benchmark, we show that our approach can lead to good results even with no labeled data at all; and that it outperforms existing supervised methods with labeled data.
Discussion: Training the system with both French and English terms improves by a large margin the performance of the system on a French benchmark, regardless of the way the embeddings were pretrained (French, English, multilingual). Our distantly supervised method can be applied to any kind of documents or medical domain, as it does not require any concept-labeled documents.
Conclusion: These experiments pave the way for simpler and more effective multilingual approaches to processing medical texts in languages other than English.

@Article{Wajsburt2021, 
  title = {{Medical concept normalization in French using multilingual terminologies and contextual embeddings}},
  author = {Wajsbürt, Perceval and Sarfati, Arnaud and Tannier, Xavier},
  year = {2021}, 
  month = jan, 
  journal = {Journal of Biomedical Informatics}, 
  volume = {114}, 
  doi = {10.1016/j.jbi.2021.103684}
}

Perceval Wajsbürt, Yoann Taillé, Xavier Tannier.

Effect of depth order on iterative nested named entity recognition models.

in Conference on Artificial Intelligence in Medecine (AIME 2021). Porto, Portugal, June 2021.

ⓘ [abstract] [BibTeX] [Long version on arXiv]

This paper studies the effect of the order of depth of mention on nested named entity recognition (NER) models. NER is an essential task in the extraction of biomedical information, and nested entities are common since medical concepts can assemble to form larger entities. Conventional NER systems only predict disjointed entities. Thus, iterative models for nested NER use multiple predictions to enumerate all entities, imposing a predefined order from largest to smallest or smallest to largest. We design an order-agnostic iterative model and a procedure to choose a custom order during training and prediction. We propose a modification of the Transformer architecture to take into account the entities predicted in the previous steps. We provide a set of experiments to study the model capabilities and the effects of the order on performance. Finally, we show that the smallest to largest order gives the best results.

@InProceedings{Wajsburt2021b, 
  title = {{Effect of depth order on iterative nested named entity recognition models}},
  author = {Perceval Wajsbürt and Yoann Taillé and Xavier Tannier},
  booktitle = {Conference on Artificial Intelligence in Medecine (AIME 2021)}, 
  address = {Porto, Portugal}, 
  year = {2021}, 
  month = jun
}

Christel Gérardin, Pascal Vaillant, Perceval Wajsbürt, Clément Gilavert, Ali Bellamine, Emmanuelle Kempf, Xavier Tannier.

Classification multilabel de concepts médicaux pour l’identification du profil clinique du patient.

in Défi Fouille de Texte (DEFT), Traitement Automatique des Langues Naturelles, 2021. Lille, France, June 2021.

ⓘ [abstract] [BibTeX] [HAL link]

La première tâche du Défi fouille de textes 2021 a consisté à extraire automatiquement, à partir de cas cliniques, les phénotypes pathologiques des patients regroupés par tête de chapitre du MeSH-maladie. La solution présentée est celle d’un classifieur multilabel basé sur un transformer. Deux transformers ont été utilisés : le camembert-large classique (run 1) et le camembert-large fine-tuné (run 2) sur des articles biomédicaux français en accès libre. Nous avons également proposé un modèle « bout-enbout », avec une première phase d’extraction d’entités nommées également basée sur un transformer de type camembert-large et un classifieur de genre sur un modèle Adaboost. Nous obtenons un très bon rappel et une précision correcte, pour une F1-mesure autour de 0,77 pour les trois runs. La performance du modèle « bout-en-bout » est similaire aux autres méthodes.

@InProceedings{Gerardin2021, 
  title = {{Classification multilabel de concepts médicaux pour l’identification du profil clinique du patient}},
  author = {Christel Gérardin and Pascal Vaillant and Perceval Wajsbürt and Clément Gilavert and Ali Bellamine and Emmanuelle Kempf and Xavier Tannier},
  booktitle = {Défi Fouille de Texte (DEFT), Traitement Automatique des Langues Naturelles, 2021}, 
  address = {Lille, France}, 
  year = {2021}, 
  month = jun
}

Ali Bellamine, Christel Daniel, Perceval Wajsbürt, Christian Roux, Xavier Tannier, Karine Briot.

Identification automatique des patients avec fractures ostéoporotiques à partir de comptes rendus médicaux.

in 34e Congrès Français de Rhumatologie. Paris, France, December 2021.

ⓘ [abstract] [BibTeX] [ScienceDirect]

Introduction Les fractures ostéoporotiques sont associées à un excès de morbi-mortalité. La mise en œuvre de parcours de soins de type filière fracture est efficace pour réduire le risque de nouvelle fracture et l’excès de mortalité. La mobilisation de ressources humaines et la difficulté à identifier les patients éligibles est l’une des limites à la mise en place et au fonctionnement de ces filières. L’objectif de l’étude est de développer et valider un outil de détection automatique permettant d’identifier les fractures ostéoporotiques chez les sujets de plus de 50 ans à partir de comptes rendus médicaux.
Patients et méthodesLe développement de l’outil de détection automatique s’appuie sur une chaîne de traitement d’algorithmes utilisant des techniques de traitement automatique du langage et de d’apprentissage automatique (Natural Language processing, Machine Learning and Rule-based solutions). Le développement de l’outil et sa validation ont été réalisés à partir des comptes rendus médicaux des départements des urgences et d’orthopédie de l’entrepôt de données de santé (EDS) de l’Assistance publique–Hôpitaux de Paris (AP–HP). L’outil a été développé à partir d’un échantillon aléatoire de 4917 documents issus d’un centre hospitalier. Les documents qui ont servi aux développements des algorithmes sont différents de ceux qui ont servi à leurs entraînements. La validation externe a été réalisée sur l’ensemble des comptes rendus médicaux d’orthopédie et des urgences recueillies en 3 mois dans l’EDS soit 154 031 documents. Les performances de l’outil (Sensibilité Se, Spécificité Sp, valeur prédictive positive VPP, valeur prédictive négative VPN) ont été calculées pour le développement et la validation de l’outil.
RésultatsL’outil a été développé à partir de 3913 documents des Urgences et 1004 documents d’orthopédie. Les performances des différents algorithmes conduisant à l’outil sont : Se comprise entre 80 et 93 %, Sp entre 62 et 99 %, VPP entre 90 et 96 % et VPN entre 69 et 99 %. L’outil a été validé dans une base de 154 031 documents (148 423 des urgences et 5608 d’orthopédie) (46 % de femmes, âge moyen 67 ans). L’outil a permis d’identifier 4 % de documents des urgences avec fracture susceptible d’être ostéoporotique (n = 5806) et 27 % des documents d’orthopédie (n = 1503), soit une population âgée de 74 ans en moyenne avec 68 % de femmes. Une validation manuelle par un expert a été réalisée sur 1000 documents avec fracture identifiée et 1000 documents sans fracture, sélectionnés au hasard. Les Se, Sp, VPP et VPN sont de 68 %, 100 %, 78 % et 99 % pour les comptes rendus des urgences et 84 %, 97 %, 92 % et 93 % pour les comptes rendus d’orthopédie.
ConclusionCette étude est le premier travail montrant qu’un outil d’identification automatique basé sur le traitement automatique du langage et d’apprentissage automatique permet d’identifier des patients avec des fractures susceptibles d’être ostéoporotique sur des comptes médicaux des urgences et d’orthopédie. Les performances de l’outil sont bonnes et permettent de répondre au besoin d’assistance à l’identification des patients dans le cadre de parcours de soins post fracture.

@InProceedings{Bellamine2021, 
  title = {{Identification automatique des patients avec fractures ostéoporotiques à partir de comptes rendus médicaux}},
  author = {Ali Bellamine and Christel Daniel and Perceval Wajsbürt and Christian Roux and Xavier Tannier and Karine Briot},
  booktitle = {34e Congrès Français de Rhumatologie}, 
  address = {Paris, France}, 
  year = {2021}, 
  month = dec
}

Nesrine Bannour, Aurélie Névéol, Xavier Tannier, Bastien Rance.

Traitement Automatique de la Langue et Intégration de Données pour les Réunions de Concertations Pluridisciplinaires en Oncologie.

in Journée AFIA/ATALA "la santé et le langage". February 2021.

ⓘ [abstract] [BibTeX] [Link]

Les réunions de concertations pluridisciplinaires (RCP) en oncologie permettent aux experts desdifférentes spécialités de choisir les meilleures options thérapeutiques pour les patients. Les donnéesnécessaires à ces réunions sont souvent collectées manuellement, avec un risque d’erreur lors del’extraction et un coût important pour les professionnels de santé. Plusieurs travaux scientifiquesportant sur des documents en anglais se sont intéressés à l’extraction automatique d’informations(telles que la localisation de la tumeur, les classifications histologiques, TNM, ...) dans les rapportscliniques des dossiers médicaux. Dans le cadre du projet ASIMOV (ASsIster la recherche en oncologie par le Machine Learning, l’intégration de dOnnées et la Visualisation), nous utiliserons le traitement automatique de la langue et l’intégrationde données pour l’extraction d’informations liées au cancer dans les entrepôts de données et les textescliniques en français.

@InProceedings{Bannour2021, 
  title = {{Traitement Automatique de la Langue et Intégration de Données pour les Réunions de Concertations Pluridisciplinaires en Oncologie}},
  author = {Bannour, Nesrine and Névéol, Aurélie and Tannier, Xavier and Rance, Bastien},
  booktitle = {Journée AFIA/ATALA "la santé et le langage"}, 
  year = {2021}, 
  month = feb
}

Adrian Ahne, Francisco Orchard, Xavier Tannier, Camille Perchoux, Beverley Balkau, Sherry Pagoto, Jessica Lee Harding, Thomas Czernichow, Guy Fagherazzi.

Insulin pricing and other major diabetes-related concerns in the USA: a study of 46,407 tweets between 2017 and 2019.

BMJ Open Diabetes Research & Care. Vol. 8, Issue 1, June 2020. doi: 10.1136/bmjdrc-2020-001190

ⓘ [abstract] [BibTeX] [BMJ link] [Ask me!]

Introduction Little research has been done to systematically evaluate concerns of people living with diabetes through social media, which has been a powerful tool for social change and to better understand perceptions around health-related issues. This study aims to identify key diabetes-related concerns in the USA and primary emotions associated with those concerns using information shared on Twitter.
Research design and methods A total of 11.7 million diabetes-related tweets in English were collected between April 2017 and July 2019. Machine learning methods were used to filter tweets with personal content, to geolocate (to the USA) and to identify clusters of tweets with emotional elements. A sentiment analysis was then applied to each cluster.
Results We identified 46 407 tweets with emotional elements in the USA from which 30 clusters were identified; 5 clusters (18% of tweets) were related to insulin pricing with both positive emotions (joy, love) referring to advocacy for affordable insulin and sadness emotions related to the frustration of insulin prices, 5 clusters (12% of tweets) to solidarity and support with a majority of joy and love emotions expressed. The most negative topics (10% of tweets) were related to diabetes distress (24% sadness, 27% anger, 21% fear elements), to diabetic and insulin shock (45% anger, 46% fear) and comorbidities (40% sadness).
Conclusions Using social media data, we have been able to describe key diabetes-related concerns and their associated emotions. More specifically, we were able to highlight the real-world concerns of insulin pricing and its negative impact on mood. Using such data can be a useful addition to current measures that inform public decision making around topics of concern and burden among people with diabetes.

@Article{Ahne2020, 
  title = {{Insulin pricing and other major diabetes-related concerns in the USA: a study of 46,407 tweets between 2017 and 2019}},
  author = {Adrian Ahne and Francisco Orchard and Xavier Tannier and Camille Perchoux and Beverley Balkau and Sherry Pagoto and Jessica Lee Harding and Thomas Czernichow and Guy Fagherazzi},
  number = {1}, 
  year = {2020}, 
  month = jun, 
  journal = {BMJ Open Diabetes Research & Care}, 
  volume = {8}, 
  doi = {10.1136/bmjdrc-2020-001190}
}

Ivan Lerner, Nicolas Paris, Xavier Tannier.

Terminologies augmented recurrent neural network model for clinical named entity recognition.

Journal of Biomedical Informatics. 102, February 2020. doi: 10.1016/j.jbi.2019.103356

ⓘ [abstract] [BibTeX] [ScienceDirect] [Ask me!]

Highlights

We have built APcNER, a French corpus for clinical named-entity recognition.
It includes a large variety of document types, and required 28 hours of annotation.
We achieved on average 84% non-exact F-measure on five types of clinical entities.
We give insight into the complementarity of terminology with a supervised model.

Objective We aimed to enhance the performance of a supervised model for clinical named-entity recognition (NER) using medical terminologies. In order to evaluate our system in French, we built a corpus for 5 types of clinical entities.
Methods We used a terminology-based system as baseline, built upon UMLS and SNOMED. Then, we evaluated a biGRU-CRF, and a hybrid system using the prediction of the terminology-based system as feature for the biGRU-CRF. In French, we built APcNER, a corpus of 147 documents annotated for 5 entities (Drug names, Signs or symptoms, Diseases or disorders, Diagnostic procedures or lab tests and Therapeutic procedures). We evaluated each NER systems using exact and partial match definition of F-measure for NER. The APcNER contains 4,837 entities, which took 28 hours to annotate. The inter-annotator agreement as measured by Cohen’s Kappa was substantial for non-exact match (Κ= 0.61) and moderate considering exact match (Κ = 0.42). In English, we evaluated the NER systems on the i2b2-2009 Medication Challenge for Drug name recognition, which contained 8,573 entities for 268 documents, and i2b2-small a version reduced to match APcNER number of entities.
Results For drug name recognition on both i2b2-2009 and APcNER, the biGRU-CRF performed better that the terminology-based system, with an exact-match F-measure of 91.1% versus 73% and 81.9% versus 75% respectively. For i2b2-small and APcNER, the hybrid system outperformed the biGRU-CRF, with an exact-match F-measure of 85.6% versus 87.8% and 88.4% versus 81.9% respectively. On APcNER corpus, the micro-average F-measure of the hybrid system on the 5 entities was 69.5% in exact match and 84.1% in non-exact match.
Conclusion APcNER is a French corpus for clinical-NER of five types of entities which covers a large variety of document types. The extension of the supervised model with terminology has allowed an easy increase in performance, especially for rare entities, and established near state of the art results on the i2b2-2009 corpus.

@Article{Lerner2020, 
  title = {{Terminologies augmented recurrent neural network model for clinical named entity recognition}},
  author = {Lerner, Ivan and Paris, Nicolas and Tannier, Xavier},
  year = {2020}, 
  month = feb, 
  journal = {Journal of Biomedical Informatics}, 
  volume = {102}, 
  doi = {10.1016/j.jbi.2019.103356}
}

Julien Tourille, Olivier Ferret, Aurélie Névéol, Xavier Tannier.

Modèle neuronal pour la résolution de la coréférence dans les dossiers médicaux électroniques.

in Actes de la 27ème conférence Traitement Automatique des Langues Naturelles (TALN 2020). Nancy, France, June 2020.

ⓘ [abstract] [BibTeX] [HAL link]

La résolution de la coréférence est un élément essentiel pour la constitution automatique de chronologies médicales à partir des dossiers médicaux électroniques. Dans ce travail, nous présentons une approche neuronale pour la résolution de la coréférence dans des textes médicaux écrits en anglais pour les entités générales et cliniques en nous évaluant dans le cadre de référence pour cette tâche que constitue la tâche 1C de la campagne i2b2 2011.

@InProceedings{Tourille2020, 
  title = {{Modèle neuronal pour la résolution de la coréférence dans les dossiers médicaux électroniques}},
  author = {Tourille, Julien and Ferret, Olivier and Névéol, Aurélie and Tannier, Xavier},
  booktitle = {Actes de la 27ème conférence Traitement Automatique des Langues Naturelles (TALN 2020)}, 
  address = {Nancy, France}, 
  year = {2020}, 
  month = jun
}

Perceval Wajsbürt, Yoann Taillé, Guillaumé Lainé, Xavier Tannier.

Participation de l'équipe du LIMICS à DEFT 2020.

in Défi Fouille de Texte (DEFT) 2020. Nancy, France, June 2020.

ⓘ [abstract] [BibTeX] [HAL link]

Nous présentons dans cet article les méthodes conçues et les résultats obtenus lors de notre participation à la tâche 3 de la campagne d'évaluation DEFT 2020, consistant en la reconnaissance d'entités nommées du domaine médical. Nous proposons deux modèles différents permettant de prendre en compte les entités imbriquées, qui représentent une des difficultés du jeu de données proposées, et présentons les résultats obtenus. Notre meilleur run obtient la meilleure performance parmi les participants, sur l'une des deux sous-tâches du défi.

@InProceedings{Wajsburt2020, 
  title = {{Participation de l'équipe du LIMICS à DEFT 2020}},
  author = {Perceval Wajsbürt and Yoann Taillé and Guillaumé Lainé and Xavier Tannier},
  booktitle = {Défi Fouille de Texte (DEFT) 2020}, 
  address = {Nancy, France}, 
  year = {2020}, 
  month = jun
}

Xavier Tannier, Nicolas Paris, Hugo Cisneros, Christel Daniel, Matthieu Doutreligne, Catherine Duclos, Nicolas Griffon, Claire Hassen-Khodja, Ivan Lerner, Adrien Parrot, Éric Sadou, Cyril Saussol, Pascal Vaillant.

Hybrid Approaches for our Participation to the n2c2 Challenge on Cohort Selection for Clinical Trials.

March 2019.

arXiv

ⓘ [abstract] [BibTeX] [arXiv]

Objective: Natural language processing can help minimize human interventionin identifying patients meeting eligibility criteria for clinical trials, butthere is still a long way to go to obtain a general and systematic approachthat is useful for researchers. We describe two methods taking a step in thisdirection and present their results obtained during the n2c2 challenge oncohort selection for clinical trials.Materials and Methods: The first methodis a weakly supervised method using an unlabeled corpus (MIMIC) to build asilver standard, by producing semi-automatically a small and very precise setof rules to detect some samples of positive and negative patients. This silverstandard is then used to train a traditional supervised model. The secondmethod is a terminology-based approach where a medical expert selects theappropriate concepts, and a procedure is defined to search the terms and checkthe structural or temporal constraints.Results: On the n2c2 dataset containing annotated data about 13 selection criteria on 288 patients, we obtained anoverall F1-measure of 0.8969, which is the third best result out of 45participant teams, with no statistically significant difference with thebest-ranked team.Discussion: Both approaches obtained very encouraging resultsand apply to different types of criteria. The weakly supervised method requiresexplicit descriptions of positive and negative examples in some reports. Theterminology-based method is very efficient when medical concepts carry most ofthe relevant information.Conclusion: It is unlikely that much more annotateddata will be soon available for the task of identifying a wide range of patientphenotypes. One must focus on weakly or non-supervised learning methods usingboth structured and unstructured data and relying on a comprehensiverepresentation of the patients.

@Misc{Tannier2019, 
  title = {{Hybrid Approaches for our Participation to the n2c2 Challenge on Cohort Selection for Clinical Trials}},
  author = {Xavier Tannier and Nicolas Paris and Hugo Cisneros and Christel Daniel and Matthieu Doutreligne and Catherine Duclos and Nicolas Griffon and Claire Hassen-Khodja and Ivan Lerner and Adrien Parrot and Éric Sadou and Cyril Saussol and Pascal Vaillant},
  year = {2019}, 
  month = mar, 
  note = {arXiv}
}

Nicolas Paris, Matthieu Doutreligne, Adrien Parrot, Xavier Tannier.

Désidentification de comptes-rendus hospitaliers dans une base de données OMOP.

in Actes de TALMED 2019 : Symposium satellite francophone sur le traitement automatique des langues dans le domaine biomédical. Lyon, France, August 2019.

ⓘ [abstract] [BibTeX] [paper]

En médecine, la recherche sur les données de patients vise à améliorer les soins. Pour préserver la vie privée des patients, ces données sont usuellement désidentifiées. Les documents textuels contiennent de nombreuses informa-tions présentes uniquement dans ce matériel et représentent donc un attrait important pour la recherche. Cependant ils représentent aussi un challenge technique lié au processus de désidentification. Ce travail propose une méthode hybride de désidentification évaluée sur un échantillon des textes de l'entrepôt de données de santé de l'Assistance Publique des Hôpitaux de Paris. Les deux apports principaux sont des performances de dési-dentification supérieures à l'état de l'art en langue française, et l'implémentation d'une chaîne de traitement standardisée librement accessible implémentée sur OMOP-CDM, un mo-dèle commun de représentation des données médicales large-ment utilisé dans le monde.

@InProceedings{Paris2019, 
  title = {{Désidentification de comptes-rendus hospitaliers dans une base de données OMOP}},
  author = {Nicolas Paris and Matthieu Doutreligne and Adrien Parrot and Xavier Tannier},
  booktitle = {Actes de TALMED 2019 : Symposium satellite francophone sur le traitement automatique des langues dans le domaine biomédical}, 
  address = {Lyon, France}, 
  year = {2019}, 
  month = aug
}

Jacques Hilbey, Louise Deléger, Xavier Tannier.

Participation de l’équipe LAI à DEFT 2019.

in Défi Fouille de Texte (DEFT) 2019. Toulouse, France, July 2019.

ⓘ [abstract] [BibTeX] [paper]

We present in this article the methods developed and the results obtained during our participation in task 3 of the DEFT 2019 evaluation campaign. We used simple rule-based or machine-learning approaches ; our results are very good on the information that is simple to extract (age, gender), they remain mixed on the more difficult tasks.

Nous présentons dans cet article les méthodes conçues et les résultats obtenus lors de notre participation à la tâche 3 de la campagne d’évaluation DEFT 2019. Nous avons utilisé des approches simples à base de règles ou d’apprentissage automatique, et si nos résultats sont très bons sur les informationssimples à extraire comme l’âge et le sexe du patient, ils restent mitigés sur les tâches plus difficiles.

@InProceedings{Hilbey2019, 
  title = {{Participation de l’équipe LAI à DEFT 2019}},
  author = {Jacques Hilbey and Louise Deléger and Xavier Tannier},
  booktitle = {Défi Fouille de Texte (DEFT) 2019}, 
  address = {Toulouse, France}, 
  year = {2019}, 
  month = jul
}

Julien Tourille, Matthieu Doutreligne, Olivier Ferret, Nicolas Paris, Aurélie Névéol, Xavier Tannier.

Evaluation of a Sequence Tagging Tool for Biomedical Texts.

in Proceedings of the EMNLP Workshop on Health Text Mining and Information Analysis (LOUHI 2018). Brussels, Belgium, October 2018.

ⓘ [abstract] [BibTeX] [ACL Anthology]

Many applications in biomedical natural language processing rely on sequence tagging asan initial step to perform more complex analysis. To support text analysis in the biomedicaldomain, we introduce Yet Another SEquenceTagger (YASET), an open-source multi purpose sequence tagger that implements state-of-the-art deep learning algorithms for sequencetagging. Herein, we evaluate YASET on part-of-speech tagging and named entity recognition in a variety of text genres including articles from the biomedical literature in English and clinical narratives in French. Tofurther characterize performance, we reportdistributions over 30 runs and different sizesof training datasets. YASET provides state-of-the-art performance on the CoNLL 2003NER dataset (F1=0.87), MEDPOST corpus(F1=0.97), MERLoT corpus (F1=0.99) andNCBI disease corpus (F1=0.81). We believethat YASET is a versatile and efficient tool thatcan be used for sequence tagging in biomedical and clinical texts.

@InProceedings{Tourille2018, 
  title = {{Evaluation of a Sequence Tagging Tool for Biomedical Texts}},
  author = {Julien Tourille and Matthieu Doutreligne and Olivier Ferret and Nicolas Paris and Aurélie Névéol and Xavier Tannier},
  booktitle = {Proceedings of the EMNLP Workshop on Health Text Mining and Information Analysis (LOUHI 2018)}, 
  address = {Brussels, Belgium}, 
  year = {2018}, 
  month = oct
}

Julien Tourille, Olivier Ferret, Xavier Tannier, Aurélie Névéol.

Neural Architecture for Temporal Relation Extraction: A Bi-LSTM Approach for Detecting Narrative Containers.

in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017, short paper). Vancouver, Canada, August 2017.

ⓘ [abstract] [BibTeX] [ACL Anthology]

We present a neural architecture for containment relation identification between medical events and/or temporal expressions. We experiment on a corpus of de-identified clinical notes in English from the Mayo Clinic, namely the THYME corpus. Our model achieves an F-measure of 0.591 and outperforms the best results reported on this corpus to date.

@InProceedings{Tourille2017b, 
  title = {{Neural Architecture for Temporal Relation Extraction: A Bi-LSTM Approach for Detecting Narrative Containers}},
  author = {Tourille, Julien and Ferret, Olivier and Tannier, Xavier and Névéol, Aurélie},
  booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017, short paper)}, 
  address = {Vancouver, Canada}, 
  year = {2017}, 
  month = aug
}

Julien Tourille, Olivier Ferret, Xavier Tannier, Aurélie Névéol.

Temporal information extraction from clinical text.

in Proceedings of the European Chapter of the ACL (EACL 2017, short paper). Valencia, Spain, April 2017.

ⓘ [abstract] [BibTeX] [poster] [ACL Anthology]

In this paper, we present a method for temporal relation extraction from clinical narratives in French and in English. We experiment on two comparable corpora, the MERLOT corpus for French and the THYME corpus for English, and show that a common approach can be used for both languages.

@InProceedings{Tourille2017, 
  title = {{Temporal information extraction from clinical text}},
  author = {Tourille, Julien and Ferret, Olivier and Tannier, Xavier and Névéol, Aurélie},
  booktitle = {Proceedings of the European Chapter of the ACL (EACL 2017, short paper)}, 
  address = {Valencia, Spain}, 
  year = {2017}, 
  month = apr
}

Julien Tourille, Olivier Ferret, Xavier Tannier, Aurélie Névéol.

LIMSI-COT at SemEval-2017 Task 12: Neural Architecture for Temporal Information Extraction from Clinical Narratives.

in Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval 2017). Vancouver, Canada, August 2017.

Selected for the "Best of SemEval 2017"

ⓘ [abstract] [BibTeX] [ACL anthology]

In this paper we present our participation to SemEval 2017 Task 12. We used aneural network based approach for entity and temporal relation extraction, and experimented with two domain adaptation strategies. We achieved competitive performance for both tasks.

@InProceedings{Tourille2017c, 
  title = {{LIMSI-COT at SemEval-2017 Task 12: Neural Architecture for Temporal Information Extraction from Clinical Narratives}},
  author = {Tourille, Julien and Ferret, Olivier and Tannier, Xavier and Névéol, Aurélie},
  booktitle = {Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval 2017)}, 
  address = {Vancouver, Canada}, 
  year = {2017}, 
  month = aug
}

Julien Tourille, Olivier Ferret, Aurélie Névéol, Xavier Tannier.

LIMSI-COT at SemEval-2016 Task 12: Temporal relation identification using a pipeline of classifiers.

in Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016). San Diego, USA, June 2016.

Selected for the "Best of SemEval 2016"

ⓘ [abstract] [BibTeX] [paper] ["Best of SemEval" slides] [poster]

SemEval 2016 Task 12 addresses temporal reasoning in the clinical domain. In this paper, we present our participation for relation extraction based on gold standard entities (subtasks DR and CR). We used a supervised approach comparing plain lexical features to word embeddings for temporal relation identification, and obtained above-median scores.

@InProceedings{Tourille2016b, 
  title = {{LIMSI-COT at SemEval-2016 Task 12: Temporal relation identification using a pipeline of classifiers}},
  author = {Tourille, Julien and Ferret, Olivier and Névéol, Aurélie and Tannier, Xavier},
  booktitle = {Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016)}, 
  address = {San Diego, USA}, 
  year = {2016}, 
  month = jun
}

Julien Tourille, Olivier Ferret, Aurélie Névéol, Xavier Tannier.

Extraction de relations temporelles dans des dossiers électroniques patient.

in Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2016, article court). Paris, France, July 2016.

ⓘ [abstract] [BibTeX] [poster] [free copy]

Temporal analysis of clinical documents yields complex representations of the information contained in Electronic Health Records. This type of analysis relies on the extraction of medical events, temporal expressions and the relations between them. In this work, we assume that relevant events and temporal expressions are available and we focus on the extraction of relations between two events or between an event and a temporal expression. We present supervised classification models and apply them to clinical documents written in French and in English. The performance we achieve is high and similar an both languages. We believe these results suggest that temporal analysis may be approached generically across clinical domains and languages.

L'analyse temporelle des documents cliniques permet d'obtenir des représentations riches des informations contenues dans les dossiers électroniques patient. Cette analyse repose sur l'extraction d'événements, d'expressions temporelles et des relations entre eux. Dans ce travail, nous considérons que nous disposons des événements et des expressions temporelles pertinents et nous nous intéressons aux relations temporelles entre deux événements ou entre un événement et une expression temporelle. Nous présentons des modèles de classification supervisée pour l'extraction de des relations en français et en anglais. Les performances obtenues sont similaires dans les deux langues, suggérant ainsi que différents domaines cliniques et différentes langues pourraient être abordés de manière similaire.

@InProceedings{Tourille2016a, 
  title = {{Extraction de relations temporelles dans des dossiers électroniques patient}},
  author = {Tourille, Julien and Ferret, Olivier and Névéol, Aurélie and Tannier, Xavier},
  booktitle = {Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2016, article court)}, 
  address = {Paris, France}, 
  year = {2016}, 
  month = jul
}

Aurélie Névéol, K. Bretonnel Cohen, Cyril Grouin, Thierry Hamon, Thomas Lavergne, Liadh Kelly, Lorraine Goeuriot, Grégoire Rey, Aude Robert, Xavier Tannier, Pierre Zweigenbaum.

Clinical Information Extraction at the CLEF eHealth Evaluation lab 2016.

in CLEF 2016 (online working notes). Evora, Portugal, September 2016.

ⓘ [abstract] [BibTeX] [free copy]

This paper reports on Task 2 of the 2016 CLEF eHealth eval-uation lab which extended the previous information extraction tasks ofShARe/CLEF eHealth evaluation labs. The task continued with namedentity recognition and normalization in French narratives, as offered in CLEF eHealth 2015. Named entity recognition involved ten types of entities including disorders that were defined according to Semantic Groups in the Unified Medical Language System (UMLS), which was also used for normalizing the entities. In addition, we introduced a large-scale classification task in French death certificates, which consisted ofextracting causes of death as coded in the International Classification of Diseases, tenth revision (ICD10). Participant systems were evaluated against a blind reference standard of 832 titles of scientific articles indexed in MEDLINE, 4 drug monographs published by the European Medicines Agency (EMEA) and 27,850 death certificates using Precision, Recall and F-measure. In total, seven teams participated, including five in the entity recognition and normalization task, and five in the death certificate coding task. Three teams submitted their systems to our newly offered reproducibility track. For entity recognition, the highest performance was achieved on the EMEA corpus, with an overall F-measure of 0.702 for plain entities recognition and 0.529 for normalized entity recognition. For entity normalization, the highest performance was achieved on the MEDLINE corpus, with an overall F-measure of 0.552. For death certificate coding, the highest performance was 0.848 F-measure.

@InProceedings{Neveol2016, 
  title = {{Clinical Information Extraction at the CLEF eHealth Evaluation lab 2016}},
  author = {Névéol, Aurélie and Cohen, K. Bretonnel and Grouin, Cyril and Hamon, Thierry and Lavergne, Thomas and Kelly, Liadh and Goeuriot, Lorraine and Rey, Grégoire and Robert, Aude and Tannier, Xavier and Zweigenbaum, Pierre},
  booktitle = {CLEF 2016 (online working notes)}, 
  address = {Evora, Portugal}, 
  year = {2016}, 
  month = sep
}

Mike Donald Tapi Nzali, Aurélie Névéol, Xavier Tannier.

Automatic Extraction of Time Expressions Accross Domains in French Narratives.

in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2015, short paper). Lisbon, Portugal, September 2015.

ⓘ [abstract] [BibTeX] [ACL Anthology]

The prevalence of temporal referencesacross all types of natural language utterances makes temporal analysis a key issue in Natural Language Processing. Thiswork adresses three research questions:1/is temporal expression recognition specific to a particular domain? 2/if so, can we characterize domain specificity? and3/how can subdomain specificity be integrated in a single tool for unified temporalexpression extraction? Herein, we assess temporal expression recognition from documents written in French covering three domains. We present a new corpus of clinical narratives annotated for temporal expressions, and also use existing corpora in the newswire and historical domains. We show that temporal expressions can be extracted with high performance across domains (best F-measure 0.96 obtained with a CRF model on clinical narratives). We argue that domain adaptation for the extraction of temporal expressions can be done with limited efforts and should cover pre-processing as well as temporal specific tasks.

@InProceedings{TapiNzali2015b, 
  title = {{Automatic Extraction of Time Expressions Accross Domains in French Narratives}},
  author = {Tapi Nzali, Mike Donald and Névéol, Aurélie and Tannier, Xavier},
  booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2015, short paper)}, 
  address = {Lisbon, Portugal}, 
  year = {2015}, 
  month = sep
}

Eva D'hondt, Xavier Tannier, Aurélie Névéol.

Redundancy in French Electronic Health Records: A preliminary study.

in Proceedings of the EMNLP International Workshop on Health Text Mining and Information Analysis (LOUHI 2015). Lisbon, Portugal, September 2015.

ⓘ [abstract] [BibTeX] [Slides] [ACL anthology]

The use of Electronic Health Records(EHRs) is becoming more prevalent inhealthcare institutions world-wide. Thesedigital records contain a wealth of information on patients’ health in the form ofNatural Language text. The electronic format of the clinical notes has evident advantages in terms of storage and shareability, but also makes it easy to duplicate information from one document to anotherthrough copy-pasting. Previous studieshave shown that (copy-paste-induced) redundancy can reach high levels in American EHRs, and that these high levelsof redundancy have a negative effect onthe performance of Natural Language Processing (NLP) tools that are used to process EHRs automatically. In this paper,we present a preliminary study on the levelof redundancy in French EHRs. We studythe evolution of redundancy over time, andits occurrence in respect to different document types and sections in a small corpuscomprising of three patient records (361documents). We find that average redundancy levels in our subset are lower thanthose observed in U.S. corpora (respectively 33% vs. up to 78%), which may indicate different cultural practices betweenthese two countries. Moreover, we find noevidence of the incremental increase (overtime) of redundant text in clinical noteswhich has been found in American EHRs.These results suggest that redundancy mitigating strategies may not be needed whenprocessing French EHRs.

@InProceedings{Dhondt2015, 
  title = {{Redundancy in French Electronic Health Records: A preliminary study}},
  author = {D'hondt, Eva and Tannier, Xavier and Névéol, Aurélie},
  booktitle = {Proceedings of the EMNLP International Workshop on Health Text Mining and Information Analysis (LOUHI 2015)}, 
  address = {Lisbon, Portugal}, 
  year = {2015}, 
  month = sep
}

Aurélie Névéol, Cyril Grouin, Xavier Tannier, Thierry Hamon, Liadh Kelly, Lorraine Goeuriot ad Pierre Zweigenbaum.

CLEF eHealth Evaluation Lab 2015 Task 1b: clinical named entity recognition.

in Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2015, CEUR-WS 1391). Toulouse, France, September 2015.

ⓘ [abstract] [BibTeX] [CEUR-WS copy]

This paper reports on Task 1b of the 2015 CLEF eHealthevaluation lab which extended the previous information extraction tasksof ShARe/CLEF eHealth evaluation labs by considering ten types of entities including disorders, that were to be extracted from biomedical textin French. The task consisted of two phases: entity recognition (phase1), in which participants could supply plain or normalized entities, andentity normalization (phase 2). The entities to be extracted were definedaccording to Semantic Groups in the Unified Medical Language System), which was also used for normalizing the entities. Participantsystems were evaluated against a blind reference standard of 832 titles ofscientific articles indexed in MEDLINE and 3 full text drug monographspublished by the European Medicines Agency (EMEA) using Precision,Recall and F-measure. In total, seven teams participated in phase 1,and three teams in phase 2. The highest performance was obtained onthe EMEA corpus, with an overall F-measure of 0.756 for plain entityrecognition, 0.711 for normalized entity recognition and 0.872 for entitynormalization.

@InProceedings{Neveol2015, 
  title = {{CLEF eHealth Evaluation Lab 2015 Task 1b: clinical named entity recognition}},
  author = {Aurélie Névéol and Cyril Grouin and Xavier Tannier and Thierry Hamon and Liadh Kelly and Lorraine Goeuriot ad Pierre Zweigenbaum},
  booktitle = {Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2015, CEUR-WS 1391)}, 
  address = {Toulouse, France}, 
  year = {2015}, 
  month = sep
}

Mike Donald Tapi Nzali, Aurélie Névéol, Xavier Tannier.

Analyse d'expressions temporelles dans les dossiers électroniques patients.

in Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2015). Caen, France, June 2015.

ⓘ [abstract] [BibTeX] [Slides in pdf] [free copy]

References to phenomena ocurring in the world and their temporal caracterization can be found in a variety of natural language utterances. For this reason, temporal analysis is a key issue in natural language processing. This article presents a temporal analysis of specialized documents. We use a corpus of documents contained in several de-identified Electronic Health Records to develop an annotated resource of temporal expressions relying on the TimeML standard. We then use this corpus to evaluate several methods for the automatic extraction of temporal expressions. Our best statistical model yields 0.91 F-measure, which provides significant improvement on extraction, over the state-of-the-art system HeidelTime. We also compare our medical corpus to FR-Timebank in order to characterize the uses of temporal expressions in two different subdomains.

Les références à des phénomènes du monde réel et à leur caractérisation temporelle se retrouvent dans beaucoup de types de discours en langue naturelle. Ainsi, l’analyse temporelle apparaît comme un élément important en traitement automatique de la langue. Cet article présente une analyse de textes en domaine de spécialité du point de vue temporel. En s'appuyant sur un corpus de documents issus de plusieurs dossiers électroniques patient désidentifiés, nous décrivons la construction d'une ressource annotée en expressions temporelles selon la norme TimeML. Par suite, nous utilisons cette ressource pour évaluer plusieurs méthodes d'extraction automatique d'expressions temporelles adaptées au domaine médical. Notre meilleur système statistique offre une performance de 0,91 de F-mesure, surpassant pour l'identification le système état de l'art HeidelTime. La comparaison de notre corpus de travail avec le corpus journalistique FR-Timebank permet également de caractériser les différences d'utilisation des expressions temporelles dans deux domaines de spécialité.

@InProceedings{TapiNzali2015, 
  title = {{Analyse d'expressions temporelles dans les dossiers électroniques patients}},
  author = {Mike Donald Tapi Nzali and Aurélie Névéol and Xavier Tannier},
  booktitle = {Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2015)}, 
  address = {Caen, France}, 
  year = {2015}, 
  month = jun
}

Cyril Grouin, Louise Deléger, Jean-Baptiste Escudié, Gregory Groisy, Anne-Sophie Jeannot, Bastien Rance, Xavier Tannier, Aurélie Névéol.

How to de-identify a large clinical corpus in 10 days.

in Proceedings of the AMIA 2014 Annual Symposium (short paper). Washington DC, USA, November 2014.

ⓘ [abstract] [BibTeX] [paper] [poster]

To de-identify a large corpus of clinical documents in French supplied by two different health care institutions, weapply a protocol built from previous work. We show that the protocol can be installed and executed by outsidecollaborators with little guidance from the authors. The automatic de-identification method used reaches 0.94 F-measure and human validation requires about 1 minute per document.

@InProceedings{Grouin2014, 
  title = {{How to de-identify a large clinical corpus in 10 days}},
  author = {Cyril Grouin and Louise Deléger and Jean-Baptiste Escudié and Gregory Groisy and Anne-Sophie Jeannot and Bastien Rance and Xavier Tannier and Aurélie Névéol},
  booktitle = {Proceedings of the AMIA 2014 Annual Symposium (short paper)}, 
  address = {Washington DC, USA}, 
  year = {2014}, 
  month = nov
}

Cyril Grouin, Natalia Grabar, Thierry Hamon, Sophie Rosset, Xavier Tannier, Pierre Zweigenbaum.

Eventual situations for timeline extraction from clinical reports.

Journal of the American Medical Informatics Association. April 2013.

ⓘ [abstract] [BibTeX] [JAMIA link] [Ask me!]

Objective. To identify the temporal relations between clinical events and temporal expressions in clinical reports, as defined in the i2b2/VA 2012 challenge.
Design. To detect clinical events, we used rules and Conditional Random Fields. We built Random Forest models to identify event modality and polarity. To identify temporal expressions we built on the HeidelTime system. To detect temporal relations, we systematically studied their breakdown into distinct situations; we designed an oracle method to determine the most prominent situations and the most suitable associated classifiers, and combined their results.
Results. We achieved F-measures of 0.8307 for event identification, based on rules, and 0.8385 for temporal expression identification. In the temporal relation task, we identified nine main situations in three groups, experimentally confirming shared intuitions: within-sentence relations, section-related time, and across-sentence relations. Logistic regression and Naïve Bayes performed best on the first and third groups, and decision trees on the second. We reached a 0.6231 global F-measure, improving by 7.5 points our official submission.
Conclusions. Carefully hand-crafted rules obtained good results for the detection of events and temporal expressions, while a combination of classifiers improved temporal link prediction. The characterization of the oracle recall of situations allowed us to point at directions where further work would be most useful for temporal relation detection: within-sentence relations and linking History of Present Illness events to the admission date. We suggest that the systematic situation breakdown proposed in this paper could also help improve other systems addressing this task.

@Article{Grouin2013, 
  title = {{Eventual situations for timeline extraction from clinical reports}},
  author = {Cyril Grouin and Natalia Grabar and Thierry Hamon and Sophie Rosset and Xavier Tannier and Pierre Zweigenbaum},
  year = {2013}, 
  month = apr, 
  journal = {Journal of the American Medical Informatics Association}
}

Pierre Zweigenbaum, Xavier Tannier.

Extraction des relations temporelles entre événements médicaux dans des comptes rendus hospitaliers.

in Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2013, article court). Les Sables d'Olonne, France, June 2013.

ⓘ [abstract] [BibTeX] [free copy]

The 2012 i2b2/VA challenge focused on the detection of temporal relations between events andtemporal expressions in English clinical texts. The addressed situations were much more diversethan in the TempEval challenges. We thus focused on the systematic study of 57 distinct situationsand their importance in the training corpus by using an oracle, and empirically determined thebest performing classifier for each situation, thereby achieving a 0.623 F-measure.

Le défi i2b2/VA 2012 était dédié à la détection de relations temporelles entre événementset expressions temporelles dans des comptes rendus hospitaliers en anglais. Les situationsconsidérées étaient beaucoup plus variées que dans les défis TempEval. Nous avons donc axénotre travail sur un examen systématique de 57 situations différentes et de leur importance dansle corpus d’apprentissage en utilisant un oracle, et avons déterminé empiriquement le classifieurqui se comportait le mieux dans chaque situation, atteignant ainsi une F-mesure globale de 0,623.

@InProceedings{Zweigenbaum13a, 
  title = {{Extraction des relations temporelles entre événements médicaux dans des comptes rendus hospitaliers}},
  author = {Pierre Zweigenbaum and Xavier Tannier},
  booktitle = {Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2013, article court)}, 
  address = {Les Sables d'Olonne, France}, 
  year = {2013}, 
  month = jun
}

Mirna El Ghosh, Varvara Kalokyri, Mélanie Sambres, Morgan Vaterkowski, Catherine Duclos, Xavier Tannier, Manolis Tsiknakis, Christel Daniel, Ferdinand Dhombres.

Towards Semantic Interoperability among Heterogeneous Cancer Image Data Models using a Layered Modular Hyperontology.

in Proceedings of the 14th Formal Ontology in Information Systems Conference (FOIS 2024). Enschede, Netherlands, July 2024.

ⓘ [abstract] [BibTeX] [free copy]

Semantic interoperability is a growing and challenging subject in thehealthcare domain. It aims to ensure a coherent and unambiguous exchange, use,and reuse of health information among different systems and applications. In thecontext of the EUCAIM (Cancer Image Europe) project, semantic interoperabilityamong various heterogeneous cancer image data models is required to support thecommunication, integration, and sharing of data in a standardized and structuredway. For this purpose, hyper-ontology is developed as a common semantic metamodel that bridges the disparate imaging and clinical knowledge of the variousrepositories in EUCAIM and supports their integration. EUCAIM’s hyper-ontologyis also an application-based ontology targeted for federated semantic querying andimage annotation. To facilitate the hyper-ontology building process and ensure theextensibility of the ontology model, an iterative hybrid well-founded approach thatdivides the ontology structure into layers and modules is established.

@InProceedings{ElGhosh2024, 
  title = {{Towards Semantic Interoperability among Heterogeneous Cancer Image Data Models using a Layered Modular Hyperontology}},
  author = {El Ghosh, Mirna and Kalokyri, Varvara and Sambres, Mélanie and Vaterkowski, Morgan and Duclos, Catherine and Tannier, Xavier and Tsiknakis, Manolis and Daniel, Christel and Dhombres, Ferdinand},
  booktitle = {Proceedings of the 14th Formal Ontology in Information Systems Conference (FOIS 2024)}, 
  address = {Enschede, Netherlands}, 
  year = {2024}, 
  month = jul
}

Matthieu Doutreligne, Adeline Degremont, Pierre-Alain Jachiet, Antoine Lamer, Xavier Tannier.

Good practices for clinical data warehouse implementation: A case study in France.

PLOS Digital Health. Vol. 2, Issue 7, July 2023. doi: 10.1371/journal.pdig.0000298

ⓘ [abstract] [BibTeX] [Direct link] [HAS report in French]

Real-world data (RWD) bears great promises to improve the quality of care. However, specific infrastructures and methodologies are required to derive robust knowledge and brings innovations to the patient. Drawing upon the national case study of the 32 French regional and university hospitals governance, we highlight key aspects of modern clinical data warehouses (CDWs): governance, transparency, types of data, data reuse, technical tools, documentation, and data quality control processes. Semi-structured interviews as well as a review of reported studies on French CDWs were conducted in a semi-structured manner from March to November 2022. Out of 32 regional and university hospitals in France, 14 have a CDW in production, 5 are experimenting, 5 have a prospective CDW project, 8 did not have any CDW project at the time of writing. The implementation of CDW in France dates from 2011 and accelerated in the late 2020. From this case study, we draw some general guidelines for CDWs. The actual orientation of CDWs towards research requires efforts in governance stabilization, standardization of data schema, and development in data quality and data documentation. Particular attention must be paid to the sustainability of the warehouse teams and to the multilevel governance. The transparency of the studies and the tools of transformation of the data must improve to allow successful multicentric data reuses as well as innovations in routine care.

@Article{Doutreligne2023, 
  title = {{Good practices for clinical data warehouse implementation: A case study in France}},
  author = {Doutreligne, Matthieu and Degremont, Adeline and Jachiet, Pierre-Alain and Lamer, Antoine and Tannier, Xavier},
  number = {7}, 
  year = {2023}, 
  month = jul, 
  journal = {PLOS Digital Health}, 
  volume = {2}, 
  doi = {10.1371/journal.pdig.0000298}
}

Matthieu Doutreligne, Adeline Degremont, Pierre-Alain Jachiet, Xavier Tannier, Antoine Lamer.

Entrepôts de données de santé hospitaliers en France : Quel potentiel pour la Haute Autorité de santé ?.

Research Report, 2022. Haute Autorité de Santé (HAS).

ⓘ [abstract] [BibTeX] [Rapport complet] [Communiqué]

Les données de vie réelle suscitent un intérêt croissant des agences sanitaires à travers le monde, que ce soit pour étudier l’usage, l’efficacité et la sécurité des produits de santé, suivre et améliorer la qualité des soins, réaliser des études épidémiologiques ou faciliter la veille sanitaire. Parmi les différentes sources de données, les entrepôts de données de santé hospitaliers (EDSH) connaissent actuellement un développement rapide sur le territoire français. Dans la perspective de mobiliser ces données dans le cadre de ses missions, la Haute Autorité de santé (HAS) a souhaité mieux comprendre cette dynamique et le potentiel de ces données. Elle a initié en novembre 2021 un travail de recherche visant à dresser un état des lieux des EDSH en France. Ce rapport pose d’abord le cadre et des définitions en s’appuyant sur la littérature. Il détaille ensuite la méthodologie de recherche, fondée sur des entretiens menés auprès d’acteurs impliqués dans les EDSH de 17 CHU et 5 autres établissements hospitaliers. Le résultat de ces entretiens est structuré par thématiques : historique, gouvernance, données intégrées, usages couverts, transparence, architecture technique et qualité de la donnée. Ce rapport discute ensuite les points d’attention identifiés pour le bon développement des EDSH et des usages secondaires des données. Il ébauche enfin deux cas d’usages pertinents pour la HAS.

@TechReport{Doutreligne2022, 
  title = {{Entrepôts de données de santé hospitaliers en France : Quel potentiel pour la Haute Autorité de santé ?}},
  author = {Doutreligne, Matthieu and Degremont, Adeline and Jachiet, Pierre-Alain and Tannier, Xavier and Lamer, Antoine},
  year = {2022}, 
  institution = {Haute Autorité de Santé (HAS)}, 
  pages = {55}
}

Tien-Duc Cao, Ludivine Duroyon, François Goasdoué, Ioana Manolescu, Xavier Tannier.

BeLink: Querying Networks of Facts, Statements and Beliefs.

in Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM, demo session). Beijing, China, November 2019.

ⓘ [abstract] [BibTeX] [HAL link]

An important class of journalistic fact-checking scenarios [2] involves verifying the claims and knowledge of different actors at different moments in time. Claims may be about facts, or about other claims, leading to chains of hearsay. We have recently proposed [4] a data model for (time-anchored) facts, statements and beliefs. It builds upon the W3C's RDF standard for Linked Open Data to describe connections between agents and their statements, and to trace information propagation as agents communicate. We propose to demonstrate BeLink, a prototype capable of storing such interconnected corpora, and answer powerful queries over them relying on SPARQL 1.1. The demo will showcase the exploration of a rich real-data corpus built from Twitter and mainstream media, and interconnected through extraction of statements with their sources, time, and topics.

@InProceedings{Cao2019b, 
  title = {{BeLink: Querying Networks of Facts, Statements and Beliefs}},
  author = {Tien-Duc Cao and Ludivine Duroyon and François Goasdoué and Ioana Manolescu and Xavier Tannier},
  booktitle = {Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM, demo session)}, 
  address = {Beijing, China}, 
  year = {2019}, 
  month = nov
}

Tien-Duc Cao, Ioana Manolescu, Xavier Tannier.

Extracting statistical mentions from textual claims to provide trusted content.

in Proceedings of the 24th International Conference on Applications of Natural Language to Information Systems (NLDB 2019). Salford, UK, November 2019.

ⓘ [abstract] [BibTeX] [HAL link] [SpringerLink]

Claims on statistic (numerical) data, e.g., immigrant populations , are often fact-checked. We present a novel approach to extract from text documents, e.g., online media articles, mentions of statistic entities from a reference source. A claim states that an entity has certain value, at a certain time. This completes a fact-checking pipeline from text, to the reference data closest to the claim. We evaluated our method on the INSEE dataset and show that it is efficient and effective.

@InProceedings{Cao2019a, 
  title = {{Extracting statistical mentions from textual claims to provide trusted content}},
  author = {Tien-Duc Cao and Ioana Manolescu and Xavier Tannier},
  booktitle = {Proceedings of the 24th International Conference on Applications of Natural Language to Information Systems (NLDB 2019)}, 
  address = {Salford, UK}, 
  year = {2019}, 
  month = nov
}

Nicolas Paris, Michael Mendis, Shawn Murphy, Christel Daniel, Xavier Tannier, Pierre Zweigenbaum.

i2b2 implemented over SMART-on-FHIR.

in Proceedings of the AMIA 2018 Informatics Summit. San Francisco, USA, March 2018.

ⓘ [abstract] [BibTeX] [paper] [slides]

Integrating Biology and the Bedside (i2b2) is the de-facto open-source medical tool for cohort discovery. Fast Healthcare Interoperability Resources (FHIR) is a new standard for exchanging health care information electronically. Substitutable Modular third-party Applications (SMART) defines the SMART-on-FHIR specification on how applications shall interface with Electronic Health Records (EHR) through FHIR. Related work made it possible to produce FHIR from an i2b2 instance or made i2b2 able to store FHIR datasets. In this paper, we extend i2b2 to search remotely into one or multiple SMART-on-FHIR Application Programming Interfaces (APIs). This enables the federation of queries, security, terminology mapping, and also bridges the gap between i2b2 and modern big-data technologies.

@InProceedings{Paris2018, 
  title = {{i2b2 implemented over SMART-on-FHIR}},
  author = {Nicolas Paris and Michael Mendis and Shawn Murphy and Christel Daniel and Xavier Tannier and Pierre Zweigenbaum},
  booktitle = {Proceedings of the AMIA 2018 Informatics Summit}, 
  address = {San Francisco, USA}, 
  year = {2018}, 
  month = mar
}

Sylvie Cazalens, Philippe Lamarre, Julien Leblay, Ioana Manolescu, Xavier Tannier.

Computational fact-checking: a content management perspective.

Rio de Janeiro, Brazil, August 2018.

Tutorial presented at the conference VLDB.

ⓘ [abstract] [BibTeX] [slides]

The tremendous value of Big Data has been noticed oflate also by the media, and the term “data journalism” hasbeen coined to refer to journalistic work inspired by dig-ital data sources. A particularly popular and active areaof data journalism is concerned with fact-checking. Theterm was born in the journalist community and referred theprocess of verifying and ensuring the accuracy of publishedmedia content; since 2012, however, it has increasingly fo-cused on the analysis of politics, economy, science, and newscontent shared in any form, but first and foremost on theWeb (social and otherwise). These trends have been no-ticed by computer scientists working in the industry andacademia. Thus, a very lively area of digital content man-agement research has taken up these problems and works topropose foundations (models), algorithms, and implementthem through concrete tools.Our proposed tutorial:

Outlines the current state ofaffairs in the area of digital (or computational) fact-checkingin newsrooms, by journalists, NGO workers, scientists andIT companies;
Shows which areas of digital contentmanagement research, in particular those relying on theWeb, can be leveraged to help fact-checking, and gives acomprehensive survey of efforts in this area;
Highlightsongoing trends, unsolved problems, and areas where we envision future scientific and practical advances.

@Misc{Cazalens2018b, 
  title = {{Computational fact-checking: a content management perspective}},
  author = {Sylvie Cazalens and Philippe Lamarre and Julien Leblay and Ioana Manolescu and Xavier Tannier},
  address = {Rio de Janeiro, Brazil}, 
  year = {2018}, 
  month = aug, 
  note = {Tutorial presented at the conference VLDB.}
}

Sylvie Cazalens, Philippe Lamarre, Julien Leblay, Ioana Manolescu, Xavier Tannier.

A Content Management Perspective on Fact-Checking.

in Proceedings of the Web Conference 2018. Lyon, France, April 2018.

ⓘ [abstract] [BibTeX] [pdf] [html]

Fact checking has captured the attention of the media and the public alike; it has also recently received strong attention from the computer science community, in particular from data and knowledge management, natural language processing and information retrieval; we denote these together under the term "content management". In this paper, we identify the fact checking tasks which can be performed with the help of content management technologies, and survey the recent research works in this area, before laying out some perspectives for the future. We hope our work will provide interested researchers, journalists and fact checkers with an entry point in the existing literature as well as help develop a roadmap for future research and development work.

@InProceedings{Cazalens2018, 
  title = {{A Content Management Perspective on Fact-Checking}},
  author = {Sylvie Cazalens and Philippe Lamarre and Julien Leblay and Ioana Manolescu and Xavier Tannier},
  booktitle = {Proceedings of the Web Conference 2018}, 
  address = {Lyon, France}, 
  year = {2018}, 
  month = apr
}

Julien Leblay, Ioana Manolescu, Xavier Tannier.

Computational fact-checking: problems, state of the art, and perspectives.

Lyon, France, April 2018.

Tutorial presented at the Web Conference 2018.

ⓘ [abstract] [BibTeX] [See our more complete VLDB slides]

The tremendous value of Big Data has been noticed of late also by the media, and the term "data journalism'' has been coined to refer to journalistic work inspired by digital data sources. A particularly popular and active area of data journalism is concerned with fact-checking. The term was born in the journalist community and referred the process of verifying and ensuring the accuracy of published media content; since 2012, however, it has increasingly focused on the analysis of politics, economy, science, and news content shared in any form, but first and foremost on the Web (social and otherwise). These trends have been noticed by computer scientists working in the industry and academia. Thus, a very lively area of digital content management research has taken up these problems and works to propose foundations (models), algorithms, and implement them through concrete tools. To cite just one example, Google has recognized the usefulness and importance of fact-checking efforts, by making an effort to index and show them next to links returned to the users.Our tutorial:

Outlines the current state of affairs in the area of digital (or computational) fact-checking in newsrooms, by journalists, NGO workers, scientists and IT companies;
Shows which areas of digital content management research, in particular those relying on the Web, can be leveraged to help fact-checking, and gives a comprehensive survey of efforts in this area;
Highlights ongoing trends, unsolved problems, and areas where we envision future scientific and practical advances.

@Misc{Leblay2018, 
  title = {{Computational fact-checking: problems, state of the art, and perspectives}},
  author = {Julien Leblay and Ioana Manolescu and Xavier Tannier},
  address = {Lyon, France}, 
  year = {2018}, 
  month = apr, 
  note = {Tutorial presented at the Web Conference 2018.}
}

Tien Duc Cao, Ioana Manolescu, Xavier Tannier.

Searching for Truth in a Database of Statistics.

in Proceedings of the 21st International Workshop on the Web and Databases (WebDB 2018). Houston, USA, June 2018.

ⓘ [abstract] [BibTeX]

The proliferation of falsehood and misinformation, in particular through the Web, has lead to increasing energy being invested into journalistic fact-checking. Fact-checking journalists typically check the accuracy of a claim against some trusted data source. Statistic databases such as those compiled by state agencies or by reputed international organizations are often used as trusted data sources, as they contain valuable, high-quality information. However, their usability is limited when they are shared in a format such as HTML or spreadsheets: this makes it hard to find the most relevant dataset for checking a specific claim, or to quickly extract from a dataset the best answer to a given query. We present a novel algorithm enabling the exploitation of such statistic tables, by 1) identifying the statistic datasets most relevant for a given fact-checking query, and 2) extracting from each dataset the best specific (precise) query answer it may contain. We have implemented our approach and experimented on the complete corpus of statistics obtained from INSEE, the French national statistic institute. Our experiments and comparisons demonstrate the effectiveness of our proposed method.

@InProceedings{Cao2018, 
  title = {{Searching for Truth in a Database of Statistics}},
  author = {Cao, Tien Duc and Manolescu, Ioana and Tannier, Xavier},
  booktitle = {Proceedings of the 21st International Workshop on the Web and Databases (WebDB 2018)}, 
  address = {Houston, USA}, 
  year = {2018}, 
  month = jun
}

Tien Duc Cao, Ioana Manolescu, Xavier Tannier.

Extracting Linked Data from statistic spreadsheets.

in 34ème Conférence sur la Gestion de Données – Principes, Technologies et Applications (BDA 2018). Bucarest, Romania, October 2018.

ⓘ [abstract] [BibTeX]

Fact-checking journalists typically check the accuracy of a claimagainst some trusted data source. Statistic databases suchas those compiled by state agencies are often used as trusted data sources, as they contain valuable, high-quality information. However, their usability is limited when they are shared in a format such as HTML or spreadsheets: this makes it hard to find the most relevant dataset for checking a specific claim, or to quickly extract from a dataset the best answer to a given query. In this work, we provide a conceptual model for the open data comprised instatistics published by INSEE, the national French economic and societalstatistics institute. Then, we describe a novel method for extractingRDF Linked Open Data, to populate an instance of this model.We used our method to produce RDF data out of 20k+Excel spreadsheets, and our validation indicates a 91% rate ofsuccessful extraction.Further, we also present a novel algorithm enabling the exploitationof such statistic tables, by (i) identifying the statistic datasetsmost relevant for a given fact-checking query, and (ii) extractingfrom each dataset the best specific (precise) query answer it maycontain. We have implemented our approach and experimented on thecomplete corpus of statistics obtained from INSEE, the French nationalstatistic institute. Our experiments and comparisons demonstrate the effectiveness of our proposed method.

@InProceedings{Cao2018b, 
  title = {{Extracting Linked Data from statistic spreadsheets}},
  author = {Cao, Tien Duc and Manolescu, Ioana and Tannier, Xavier},
  booktitle = {34ème Conférence sur la Gestion de Données – Principes, Technologies et Applications (BDA 2018)}, 
  address = {Bucarest, Romania}, 
  year = {2018}, 
  month = oct
}

Tien Duc Cao, Ioana Manolescu, Xavier Tannier.

Extracting Linked Data from statistic spreadsheets.

in Proceedings of the SIGMOD workshop on Semantic Big Data (SBD 2017). Chicago, USA, May 2017.

ⓘ [abstract] [BibTeX] [ACMLink] [paper]

Statistic data is an important sub-category of open data; it is interesting for many applications, including but not limited to data journalism, as such data is typically of high quality, and reflects (under an aggregated form) important aspects of a society’s life such as births, immigration, economic output etc. However, such open data is often not published as Linked Open Data (LOD) limiting its usability.We provide a conceptual model for the open data comprised in statistic files published by INSEE, the leading French economic and societal statistics institute. Then, we describe a novel method for extracting RDF LOD populating an instance of this conceptual model. Our method was used to produce RDF data out of 20k+ Excel spreadsheets, and our validation indicates a 91% rate of successful extraction.

@InProceedings{Cao2017, 
  title = {{Extracting Linked Data from statistic spreadsheets}},
  author = {Cao, Tien Duc and Manolescu, Ioana and Tannier, Xavier},
  booktitle = {Proceedings of the SIGMOD workshop on Semantic Big Data (SBD 2017)}, 
  address = {Chicago, USA}, 
  year = {2017}, 
  month = may
}

Alaedine Benani, Stéphane Ohayon, Fewa Laleye, Pierre Bauvin, Emmanuel Messas, Sylvain Bodard, Xavier Tannier.

Is Multimodal Better? A Systematic Review of Multimodal versus Unimodal Machine Learning in Clinical Decision-Making.

March 2025.

medRxiv

ⓘ [abstract] [BibTeX] [medRxiv]

Machine learning has demonstrated success in clinical decision-making, yet the added value of multimodal approaches over unimodal models remains unclear. This systematic review evaluates studies comparing multimodal and unimodal ML algorithms for diagnosis, prognosis, or prescription. A comprehensive search of MEDLINE up to January 2025 identified 97 studies across 12 medical specialties, with oncology being the most represented. The most common data fusion involved tabular data and images (67%). A risk of bias assessment using PROBAST revealed that 57% of studies had a low risk of bias, while 41% had a high risk. Multimodality outperformed unimodality in 91% cases. No correlation between dataset sample size and added performance has been observed. However, considerable methodological heterogeneity and potential publication bias warrant caution in interpretation. Further research is needed to refine evaluation metrics and hybrid model architectures based on specific clinical tasks.

@Misc{Benani2025, 
  title = {{Is Multimodal Better? A Systematic Review of Multimodal versus Unimodal Machine Learning in Clinical Decision-Making}},
  author = {Benani, Alaedine and Ohayon, Stéphane and Laleye, Fewa and Bauvin, Pierre and Messas, Emmanuel and Bodard, Sylvain and Tannier, Xavier},
  year = {2025}, 
  month = mar, 
  note = {medRxiv}
}

Ambre La Rosa, Marie Verdoux, Pierre Riebler, Isabelle Lolli, Christel and Daniel, Xavier Tannier, Sarah Atallah, Bertrand Baujat, Emmanuelle Kempf.

Multimodal identification of a rare head and neck cancer patient cohort in the clinical data warehouse of Greater Paris Teaching Hospital.

ESMO Real World Data and Digital Oncology. 8, June 2025. doi: 10.1016/j.esmorw.2025.100151

ⓘ [abstract] [BibTeX] [ScienceDirect Link]

Background: Ten percent of head and neck cancers (HNCs) differ from the common upper aerodigestive tract squamous-cell carcinoma. These rare HNCs can be rare because of their histology or anatomical location. The federation of clinical data warehouses (CDWs) holds potential for advancing our understanding of these pathologies. This study aimed to develop a multimodal algorithm to identify rare HNC patients in a CDW.
Materials and methods: We carried out a cross-sectional study on the CDW of a conglomerate of 38 university hospitals. We developed a multimodal classification algorithm to identify rare HNC patients by integrating International Classification of Diseases, 10th revision (ICD-10) codes, Association for the Development of Computer Science in Cytology and Pathological Anatomy (ADICAP) codes and free-text data from pathology reports using natural language processing (NLP). Algorithm performance was evaluated by an HNC medical expert using a validation set of 100 manually annotated cases.
Results: Of 333 852 cancer patients, 9141 were identified as HNC patients based on ICD-10 and ADICAP codes. The multimodal algorithm using ICD-10 or ADICAP codes or NLP-processed free text classified 4515 patients as rare HNC patients, with 2168 identified by a minimum of two data sources. It showed a 91% sensitivity and a 95% specificity when relying on multiple data sources, with a 76% positive predictive value observed for rare histology identification compared with 43% for rare topography.
Conclusions: This study demonstrates the feasibility and utility of a multimodal electronic health record-based approach to identify rare HNC patients in a CDW. Incorporating free-text and structured data improves the reliability of such cohort identification.

@Article{LaRosa2025, 
  title = {{Multimodal identification of a rare head and neck cancer patient cohort in the clinical data warehouse of Greater Paris Teaching Hospital}},
  author = {La Rosa, Ambre and Verdoux, Marie and Riebler, Pierre and Lolli, Isabelle and and Daniel, Christel and Tannier, Xavier and Atallah, Sarah and Baujat, Bertrand and Kempf, Emmanuelle},
  year = {2025}, 
  month = jun, 
  journal = {ESMO Real World Data and Digital Oncology}, 
  volume = {8}, 
  doi = {10.1016/j.esmorw.2025.100151}
}

Manon Chossegros, François Delhommeau, Daniel Stockholm, Xavier Tannier.

Improving the generalizability of white blood cell classification with few-shot domain adaptation.

Journal of Pathology Informatics. 15, December 2024. doi: 10.1016/j.jpi.2024.100405

ⓘ [abstract] [BibTeX] [Link to free copy]

The morphological classification of nucleated blood cells is fundamental for the diagnosis of hematological diseases. Many Deep Learning algorithms have been implemented to automatize this classification task, but most of the time they fail to classify images coming from different sources. This is known as "domain shift". Whereas some research has been conducted in this area, domain adaptation techniques are often computationally expensive and can introduce significant modifications to initial cell images. In this article, we propose an easy-to-implement workflow where we trained a model to classify images from two datasets, and tested it on images coming from eight other datasets. An EfficientNet model was trained on a source dataset comprising images from two different datasets. It was afterwards fine-tuned on each of the eight target datasets by using 100 or less-annotated images from these datasets. Images from both the source and the target dataset underwent a color transform to put them into a standardized color style. The importance of color transform and fine-tuning was evaluated through an ablation study and visually assessed with scatter plots, and an extensive error analysis was carried out. The model achieved an accuracy higher than 80% for every dataset and exceeded 90% for more than half of the datasets. The presented workflow yielded promising results in terms of generalizability, significantly improving performance on target datasets, whereas keeping low computational cost and maintaining consistent color transformations. Source code is available at: https://github.com/mc2295/WBC_Generalization.

@Article{Chossegros2024b, 
  title = {{Improving the generalizability of white blood cell classification with few-shot domain adaptation}},
  author = {Chossegros, Manon and Delhommeau, François and Stockholm, Daniel and Tannier, Xavier},
  year = {2024}, 
  month = dec, 
  journal = {Journal of Pathology Informatics}, 
  volume = {15}, 
  doi = {10.1016/j.jpi.2024.100405}
}

Manon Chossegros, Xavier Tannier, Daniel Stockholm.

Improving Interpretability of Leucocyte Classification with Multimodal Network.

in Proceedings of Medical Informatics Europe 2024 (MIE -- published in Studies in Health Technology and Informatics, Volume 316: Digital Health and Informatics Innovations for Sustainable Health Care Systems). Athens, Greece, pages 1098-1102, August 2024.

ⓘ [abstract] [BibTeX] [free copy]

White blood cell classification plays a key role in the diagnosis of hematologic diseases. Models can perform classification either from images or based on morphological features. Image-based classification generally yields higher performance, but feature-based classification is more interpretable for clinicians. In this study, we employed a Multimodal neural network to classify white blood cells, utilizing a combination of images and morphological features. We compared this approach with image-only and feature-only training. While the highest performance was achieved with image-only training, the Multimodal model provided enhanced interpretability by the computation of SHAP values, and revealed crucial morphological features for biological characterization of the cells.

@InProceedings{Chossegros2024, 
  title = {{Improving Interpretability of Leucocyte Classification with Multimodal Network}},
  author = {Chossegros, Manon and Tannier, Xavier and Stockholm, Daniel},
  booktitle = {Proceedings of Medical Informatics Europe 2024 (MIE -- published in Studies in Health Technology and Informatics, Volume 316: Digital Health and Informatics Innovations for Sustainable Health Care Systems)}, 
  address = {Athens, Greece}, 
  year = {2024}, 
  month = aug, 
  pages = {1098-1102}
}

Aniss Acherar, Xavier Tannier, Ilhame Tantaoui, Jean-Yves Brossas, Marc Thellier, Renaud Piarroux.

Evaluating Plasmodium falciparum automatic detection and parasitemia estimation: A comparative study on thin blood smear images.

PLOS One. Vol. 19, Issue 6, June 2024. doi: 10.1371/journal.pone.0304789

ⓘ [abstract] [BibTeX] [GitHub repo] [PLOS One link]

Malaria is a deadly disease that is transmitted through mosquito bites. Microscopists use a microscope to examine thin blood smears at high magnification (1000x) to identify parasites in red blood cells (RBCs). Estimating parasitemia is essential in determining the severity of the Plasmodium falciparum infection and guiding treatment. However, this process is time-consuming, labor-intensive, and subject to variation, which can directly affect patient outcomes. In this retrospective study, we compared three methods for measuring parasitemia from a collection of anonymized thin blood smears of patients with Plasmodium falciparum obtained from the Clinical Department of Parasitology-Mycology, National Reference Center (NRC) for Malaria in Paris, France. We first analyzed the impact of the number of field images on parasitemia count using our framework, MALARIS, which features a top-classifier convolutional neural network (CNN). Additionally, we studied the variation between different microscopists using two manual techniques to demonstrate the need for a reliable and reproducible automated system. Finally, we included thin blood smear images from an additional 102 patients to compare the performance and correlation of our system with manual microscopy and flow cytometry. Our results showed strong correlations between the three methods, with a coefficient of determination between 0.87 and 0.92.

@Article{Acherar2024, 
  title = {{Evaluating Plasmodium falciparum automatic detection and parasitemia estimation: A comparative study on thin blood smear images}},
  author = {Acherar, Aniss and Tannier, Xavier and Tantaoui, Ilhame and Brossas, Jean-Yves and Thellier, Marc and Piarroux, Renaud},
  number = {6}, 
  year = {2024}, 
  month = jun, 
  journal = {PLOS One}, 
  volume = {19}, 
  doi = {10.1371/journal.pone.0304789}
}

Noshine Mohammad, Pauline Naudon, Abdoulaye Kane Dia, Pierre-Yves Boëlle, Abdoulaye Konaté, Lassana Konaté, El Hadji Amadou Niang, Renaud Piarroux, Xavier Tannier, Cécile Nabet.

Predicting the age of field Anopheles mosquitoes using mass spectrometry and deep learning.

Science Advances. Vol. 10, Issue 19, May 2024. doi: 10.1126/sciadv.adj6990

ⓘ [abstract] [BibTeX] [Science Link]

Mosquito-borne diseases like malaria are rising globally, and improved mosquito vector surveillance is needed. Survival of Anopheles mosquitoes is key for epidemiological monitoring of malaria transmission and evaluation of vector control strategies targeting mosquito longevity, as the risk of pathogen transmission increases with mosquito age. However, the available tools to estimate field mosquito age are often approximate and time-consuming. Here, we show a rapid method that combines matrix-assisted laser desorption/ionization–time-of-flight mass spectrometry with deep learning for mosquito age prediction. Using 2763 mass spectra from the head, legs, and thorax of 251 field-collected Anopheles arabiensis mosquitoes, we developed deep learning models that achieved a best mean absolute error of 1.74 days. We also demonstrate consistent performance at two ecological sites in Senegal, supported by age-related protein changes. Our approach is promising for malaria control and the field of vector biology, benefiting other disease vectors like Aedes mosquitoes.

@Article{Mohammad2024b, 
  title = {{Predicting the age of field Anopheles mosquitoes using mass spectrometry and deep learning}},
  author = {Mohammad, Noshine and Naudon, Pauline and Kane Dia, Abdoulaye and Boëlle, Pierre-Yves and Konaté, Abdoulaye and Konaté, Lassana and Niang, El Hadji Amadou and Piarroux, Renaud and Tannier, Xavier and Nabet, Cécile},
  number = {19}, 
  year = {2024}, 
  month = may, 
  journal = {Science Advances}, 
  volume = {10}, 
  doi = {10.1126/sciadv.adj6990}
}

Mirna El Ghosh, Varvara Kalokyri, Mélanie Sambrès, Morgan Vaterkowski, Catherine Duclos, Xavier Tannier, Gianna Tsakou, Manolis Tsiknakis, Christel Daniel, Ferdinand Dhombres.

From Syntactic to Semantic Interoperability Using a Hyperontology in the Oncology Domain.

in Proceedings of Medical Informatics Europe 2024 (MIE -- published in Studies in Health Technology and Informatics, Volume 316: Digital Health and Informatics Innovations for Sustainable Health Care Systems). Athens, Greece, pages 1385-1389, August 2024.

ⓘ [abstract] [BibTeX] [free copy]

Interoperability is crucial to overcoming various challenges of dataintegration in the healthcare domain. While OMOP and FHIR data standards handlesyntactic heterogeneity among heterogeneous data sources, ontologies supportsemantic interoperability to overcome the complexity and disparity of healthcaredata. This study proposes an ontological approach in the context of the EUCAIMproject to support semantic interoperability among distributed big data repositoriesthat have applied heterogeneous cancer image data models using a semanticallywell-founded Hyperontology for the oncology domain.

@InProceedings{ElGhosh2024b, 
  title = {{From Syntactic to Semantic Interoperability Using a Hyperontology in the Oncology Domain}},
  author = {El Ghosh, Mirna and Kalokyri, Varvara and Sambrès, Mélanie and Vaterkowski, Morgan and Duclos, Catherine and Tannier, Xavier and Tsakou, Gianna and Tsiknakis, Manolis and Daniel, Christel and Dhombres, Ferdinand},
  booktitle = {Proceedings of Medical Informatics Europe 2024 (MIE -- published in Studies in Health Technology and Informatics, Volume 316: Digital Health and Informatics Innovations for Sustainable Health Care Systems)}, 
  address = {Athens, Greece}, 
  year = {2024}, 
  month = aug, 
  pages = {1385-1389}
}

Noshine Mohammad, Antoine Huguenin, Annick Lefebvre, Laura Menvielle, Dominique Toubas, Stéphane Ranque, Isabelle Villena, Xavier Tannier, Anne-Cécile Normand, Renaud Piarroux.

Nosocomial transmission of Aspergillus flavus in a neonatal intensive care unit: long term persistence in environment and interest of MALDI-ToF Mass-Spectrometry coupled with Convolutional Neural Network (CNN) for rapid clone recognition.

Medical Mycology. January 2024. doi: 10.1093/mmy/myad136

ⓘ [abstract] [BibTeX] [Pubmed link] [Ask me!]

Aspergillosis of the new-born remains a rare but severe disease. We report four cases of primary cutaneous A. flavus infections in premature new-borns linked to incubators contamination by putative clonal strains. Our objective was to evaluate the ability of MALDI-TOF coupled to Convolutional Neural Network (CNN) for clone recognition in a context where only a very small number of strains are available for machine learning. Clinical and environmental A. flavus isolates (n = 64) were studied, 15 were epidemiologically related to the four cases. All strains were typed using microsatellite length polymorphism. We found a common genotype for 9/15 related strains. The isolates of this common genotype were selected to obtain a training dataset (6 clonal isolates/25 non-clonal) and a test dataset (3 clonal isolates/31 non-clonal), and spectra were analysed with a simple CNN model.
On the test dataset using CNN model, all 31 non clonal isolates were correctly classified, 2/3 clonal isolates were unambiguously correctly classified whereas the third strain was undetermined (i.e the CNN model was unable to discriminate between GT8 and non-GT8). Clonal strains of A. flavus have persisted in the neonatal intensive care unit for several years. Indeed, two strains of A. flavus isolated from incubators in September 2007, are identical to the strain responsible for the second case that occurred 3 years later.
MALDI-TOF is a promising tool for detecting clonal isolates of A. flavus using CNN even with a limited training set for limited cost and handling time.

@Article{Mohammad2024, 
  title = {{Nosocomial transmission of Aspergillus flavus in a neonatal intensive care unit: long term persistence in environment and interest of MALDI-ToF Mass-Spectrometry coupled with Convolutional Neural Network (CNN) for rapid clone recognition}},
  author = {Mohammad, Noshine and Huguenin, Antoine and Lefebvre, Annick and Menvielle, Laura and Toubas, Dominique and Ranque, Stéphane and Villena, Isabelle and Tannier, Xavier and Normand, Anne-Cécile and Piarroux, Renaud},
  year = {2024}, 
  month = jan, 
  journal = {Medical Mycology}, 
  doi = {10.1093/mmy/myad136}
}

Denizhan Demirkol, Çiğdem Selçukcan Erol, Xavier Tannier, Tuncay Özcan, Şamil Aktaş.

Prediction of amputation risk of patients with diabetic foot using classification algorithms: A clinical study from a tertiary center.

International Wound Journal. Vol. 21, Issue 1, January 2024. doi: https://onlinelibrary.wiley.com/doi/10.1111/iwj.14556

ⓘ [abstract] [BibTeX] [Link]

Diabetic foot ulcers can have vital consequences, such as amputation for patients. The primary purpose of this study is to predict the amputation risk of diabetic foot patients using machine-learning classification algorithms. In this research, 407 patients treated with the diagnosis of diabetic foot between January 2009–September 2019 in Istanbul University Faculty of Medicine in the Department of Undersea and Hyperbaric Medicine were retrospectively evaluated. Principal Component Analysis (PCA) was used to identify the key features associated with the amputation risk in diabetic foot patients within the dataset. Thus, various prediction/classification models were created to predict the “overall” risk of diabetic foot patients. Predictive machine-learning models were created using various algorithms. Additionally to optimize the hyperparameters of the Random Forest Algorithm (RF), experimental use of Bayesian Optimization (BO) has been employed. The sub-dimension data set comprising categorical and numerical values was subjected to a feature selection procedure. Among all the algorithms tested under the defined experimental conditions, the BO-optimized “RF” based on the hybrid approach (PCA-RF-BO) and “Logistic Regression” algorithms demonstrated superior performance with 85% and 90% test accuracies, respectively. In conclusion, our findings would serve as an essential benchmark, offering valuable guidance in reducing such hazards.

@Article{Demirkol2024, 
  title = {{Prediction of amputation risk of patients with diabetic foot using classification algorithms: A clinical study from a tertiary center}},
  author = {Denizhan Demirkol and Çiğdem Selçukcan Erol and Xavier Tannier and Tuncay Özcan and Şamil Aktaş},
  number = {1}, 
  year = {2024}, 
  month = jan, 
  journal = {International Wound Journal}, 
  volume = {21}, 
  doi = {https://onlinelibrary.wiley.com/doi/10.1111/iwj.14556}
}

Marie Verdoux, Ambre La Rosa, Isabelle Lolli, Xavier Tannier, Bertrand Baujat, Emmanuelle Kempf.

Identification multimodale d'une cohorte de patients porteurs de cancers rares de la tête et du cou au sein de l'Entrepôt de données de santé (EDS) de l'AP-HP.

in Congrès ÉMOIS, Special Issue of the Journal of Epidemiology and Population Health. March 2024.

ⓘ [BibTeX] [ScienceDirect Link]

Ce papier explore deux approches pour répondre aux questions à choix multiples (QCM) de pharmacie du défi DEFT 2024 en utilisant des modèles de langue (LLMs) entraînés sur des données ouvertes avec moins de 3 milliards de paramètres. Les deux approches reposent sur l'architecture RAG (Retrieval Augmented Generation) pour combiner la récupération de contexte à partir de bases de connaissances externes (NACHOS et Wikipédia) avec la génération de réponses par le LLM Apollo-2B. La première approche traite directement les QCMs et génère les réponses en une seule étape, tandis que la seconde approche reformule les QCMs en questions binaires (Oui/Non) puis génère une réponse pour chaque question binaire. Cette dernière approche obtient un Exact Match Ratio de 14.7 et un Hamming Score de 51.6 sur le jeu de test, ce qui démontre le potentiel du RAG pour des tâches de Q/A sous de telles contraintes.

@InProceedings{Verdoux2024, 
  title = {{Identification multimodale d'une cohorte de patients porteurs de cancers rares de la tête et du cou au sein de l'Entrepôt de données de santé (EDS) de l'AP-HP}},
  author = {Verdoux, Marie and La Rosa, Ambre and Lolli, Isabelle and Tannier, Xavier and Baujat, Bertrand and Kempf, Emmanuelle},
  booktitle = {Congrès ÉMOIS, Special Issue of the Journal of Epidemiology and Population Health}, 
  year = {2024}, 
  month = mar
}

Aude Girault, Maximino Linares, Xavier Tannier.

460 Enhancing neonatal acidosis prediction: A Machine Learning approach using CTG features and clinical characteristics.

in SMFM 44th Annual Meeting: The Pregnancy Meeting (American Journal of Obstretrics and Gynecology). January 2024.

ⓘ [BibTeX] [AJOG Link]

Ce papier explore deux approches pour répondre aux questions à choix multiples (QCM) de pharmacie du défi DEFT 2024 en utilisant des modèles de langue (LLMs) entraînés sur des données ouvertes avec moins de 3 milliards de paramètres. Les deux approches reposent sur l'architecture RAG (Retrieval Augmented Generation) pour combiner la récupération de contexte à partir de bases de connaissances externes (NACHOS et Wikipédia) avec la génération de réponses par le LLM Apollo-2B. La première approche traite directement les QCMs et génère les réponses en une seule étape, tandis que la seconde approche reformule les QCMs en questions binaires (Oui/Non) puis génère une réponse pour chaque question binaire. Cette dernière approche obtient un Exact Match Ratio de 14.7 et un Hamming Score de 51.6 sur le jeu de test, ce qui démontre le potentiel du RAG pour des tâches de Q/A sous de telles contraintes.

@InProceedings{Girault2024, 
  title = {{460 Enhancing neonatal acidosis prediction: A Machine Learning approach using CTG features and clinical characteristics}},
  author = {Girault, Aude and Linares, Maximino and Tannier, Xavier},
  booktitle = {SMFM 44th Annual Meeting: The Pregnancy Meeting (American Journal of Obstretrics and Gynecology)}, 
  year = {2024}, 
  month = jan
}

Aude Girault, Camille Le Ray, Charles Garabedian, François Goffinet, Xavier Tannier.

Re-evaluating fetal scalp pH thresholds: An examination of fetal pH variations during labor.

Acta Obstetricia et Gynecologica Scandinavica. December 2023. doi: 10.1111/aogs.14739

ⓘ [abstract] [BibTeX] [Wiley link]

Introduction: Since the 1970s, fetal scalp blood sampling (FSBS) has been used as a second-line test of the acid-base status of the fetus to evaluate fetal well-being during labor. The commonly employed thresholds that delineate normal pH (>7.25), subnormal (7.20-7.25), and pathological pH (<7.20) guide clinical decisions. However, these experienced-based thresholds, based on observations and common sense, have yet to be confirmed. The aim of the study was to investigate if pH drop rate accelerates at the common thresholds (7.25 and 7.20) and to explore the possibility of identifying more accurate thresholds.
Material and methods: A retrospective study was conducted at a tertiary maternity hospital between June 2017 and July 2021. Patients with at least one FSBS during labor for category II fetal heart rate and delivery of a singleton cephalic infant were included. The rate of change in pH value between consecutive samples for each patient was calculated and plotted as a function of pH value. Linear regression models were used to model the evolution of the pH drop rate estimating slope and standard errors across predefined pH intervals. Exploration of alternative pH action thresholds was conducted. To explore the independence of the association between pH value and pH drop rate, multiple linear regression adjusted on age, body mass index, parity, oxytocin stimulation and suspected small for gestational age was performed.
Results:We included 2047 patients with at least one FSBS (total FSBS 3467); with 2047 umbilical cord blood pH, and a total of 5514 pH samples. Median pH values were 7.29 1 h before delivery, 7.26 30 min before delivery. The pH drop was slow between 7.40 and 7.30, then became more pronounced, with median rates of 0.0005 units/min at 7.25 and 0.0013 units/min at 7.20. Out of the alternative pH thresholds, 7.26 and 7.20 demonstrated the best alignment with our dataset. Multiple linear regression revealed that only pH value was significantly associated to the rate of pH change.
Conclusions:Our study confirms the validity and reliability of current guideline thresholds for fetal scalp pH in category II fetal heart rate.

@Article{Girault2023, 
  title = {{Re-evaluating fetal scalp pH thresholds: An examination of fetal pH variations during labor}},
  author = {Girault, Aude and Le Ray, Camille and Garabedian, Charles and Goffinet, François and Tannier, Xavier},
  year = {2023}, 
  month = dec, 
  journal = {Acta Obstetricia et Gynecologica Scandinavica}, 
  doi = {10.1111/aogs.14739}
}

Noshine Mohammad, Anne-Cécile Normand, Cécile Nabet, Alexandre Godmer, Jean-Yves Brossas, Marion Blaize, Christine Bonnal, Arnaud Fekkar, Sébastien Imbert, Xavier Tannier, Renaud Piarroux.

Improving the Detection of Epidemic Clones in Candida parapsilosis Outbreaks by Combining MALDI-TOF Mass Spectrometry and Deep Learning Approaches.

Microorganisms. Vol. 11, Issue 4, April 2023. doi: 10.3390/microorganisms11041071

ⓘ [abstract] [BibTeX] [Direct link]

Identifying fungal clones propagated during outbreaks in hospital settings is a problem that increasingly confronts biologists. Current tools based on DNA sequencing or microsatellite analysis require specific manipulations that are difficult to implement in the context of routine diagnosis. Using deep learning to classify the mass spectra obtained during the routine identification of fungi by MALDI-TOF mass spectrometry could be of interest to differentiate isolates belonging to epidemic clones from others. As part of the management of a nosocomial outbreak due to Candida parapsilosis in two Parisian hospitals, we studied the impact of the preparation of the spectra on the performance of a deep neural network. Our purpose was to differentiate 39 otherwise fluconazole-resistant isolates belonging to a clonal subset from 56 other isolates, most of which were fluconazole-susceptible, collected during the same period and not belonging to the clonal subset. Our study carried out on spectra obtained on four different machines from isolates cultured for 24 or 48 h on three different culture media showed that each of these parameters had a significant impact on the performance of the classifier. In particular, using different culture times between learning and testing steps could lead to a collapse in the accuracy of the predictions. On the other hand, including spectra obtained after 24 and 48 h of growth during the learning step restored the good results. Finally, we showed that the deleterious effect of the device variability used for learning and testing could be largely improved by including a spectra alignment step during preprocessing before submitting them to the neural network. Taken together, these experiments show the great potential of deep learning models to identify spectra of specific clones, providing that crucial parameters are controlled during both culture and preparation steps before submitting spectra to a classifier.

@Article{Mohammad2023, 
  title = {{Improving the Detection of Epidemic Clones in Candida parapsilosis Outbreaks by Combining MALDI-TOF Mass Spectrometry and Deep Learning Approaches}},
  author = {Mohammad, Noshine and Normand, Anne-Cécile and Nabet, Cécile and Godmer, Alexandre and Brossas, Jean-Yves and Blaize, Marion and Bonnal, Christine and Fekkar, Arnaud and Imbert, Sébastien and Tannier, Xavier and Piarroux, Renaud},
  number = {4}, 
  year = {2023}, 
  month = apr, 
  journal = {Microorganisms}, 
  volume = {11}, 
  doi = {10.3390/microorganisms11041071}
}

Aniss Acherar, Ilhame Tantaoui, Marc Thellier, Alexandre Lampros, Renaud Piarroux, Xavier Tannier.

Real-life evaluation of deep learning models trained on two datasets for Plasmodium falciparum detection with thin blood smear images at 500x magnification.

Informatics in Medicine Unlocked. Vol. 35, Issue 101132, November 2022. doi: 10.1016/j.imu.2022.101132

ⓘ [abstract] [BibTeX] [Ask me!] [ScienceDirect link]

Highlights

We built a new dataset of real-life P. falciparum-infected red blood cells and uninfected blood components.
We cross-validated deep learning models for the classification of P. falciparum-infected red blood cells using two datasets.
We performed a patient-level validation to assess the generalizability of the models in real-life conditions.
We demonstrated that our dataset generalizes better than the National Institute of Health (NIH) malaria dataset.

Abstract
Malaria is a fatal disease transmitted by bites from mosquito-type vectors. Biologists examined blood smears under a microscope at high magnification (1000 × ) to identify the presence of parasites in red blood cells (RBCs). Such an examination is laborious and time-consuming. Moreover, microscopists sometimes have difficulty identifying parasitized RBCs due to a lack of skill or practice. Deep learning, especially convolutional neural networks (CNNs) applied for malaria diagnosis, are able to identify complex features of a large number of medical images.The proposed work focuses on the construction of a dataset of blood components images representative of the diagnostic reality captured from 202 patients at 500x magnification. We evaluated through a cross-validation study different deep learning networks for the classification of Plasmodium falciparum-infected RBCs and uninfected blood components. These models include a custom-built CNN, VGG-19, ResNet-50 and EfficientNet-B7. In addition, we conducted the same experiments on a public dataset and compared the performance of the resultant models through a patient-level inference including 200 extra patients. The models trained on our dataset show better performance in terms of generalization and achieved better accuracy, sensitivity and specificity scores of 99.7%, 77.9% and 99.8%, respectively.

@Article{Acherar2022, 
  title = {{Real-life evaluation of deep learning models trained on two datasets for Plasmodium falciparum detection with thin blood smear images at 500x magnification}},
  author = {Acherar, Aniss and Tantaoui, Ilhame and Thellier, Marc and Lampros, Alexandre and Piarroux, Renaud and Tannier, Xavier},
  number = {101132}, 
  year = {2022}, 
  month = nov, 
  journal = {Informatics in Medicine Unlocked}, 
  volume = {35}, 
  doi = {10.1016/j.imu.2022.101132}
}

Cécile Nabet, Aniss Acherar, Antoine Huguenin, Xavier Tannier, Renaud Piarroux.

Artificial Intelligence and Malaria.

in Artificial Intelligence in Medicine, Lidströmer, Niklas and Ashrafian, Hutan (eds).

Springer International Publishing, August 2022. ISBN 978-3-030-58080-3.

doi: 10.1007/978-3-030-64573-1_273

ⓘ [abstract] [BibTeX] [SpringerLink] [Ask me!]

Malaria disease is due to the infection with Plasmodium parasites transmitted by a mosquito vector belonging to the genus Anopheles. To combat malaria, effective diagnosis and treatment using artemisinin-based combinations are needed, as well as strategies that are aimed at reducing or stopping transmission by mosquito vectors. Even if the conventional microscopic diagnosis is the gold standard for malaria diagnosis, it is time consuming, and the diagnostic performance depends on techniques and human expertise. In addition, tools for characterizing Anopheles vectors are limited and difficult to establish in the field. The advent of computational biology, information technology infrastructures, and mobile computing power offers the opportunity to use artificial intelligence (AI) approaches to address challenges and technical needs specific to malaria-endemic countries. This chapter illustrates the trends, advances, and future challenges linked to the deployment of AI in malaria. Two innovative AI approaches are described. The first is the image-based automatic classification of malaria parasites and vectors, and the second is the proteomics analysis of vectors. The developed applications are aimed at facilitating malaria diagnosis by performing malaria parasite detection, species identification, and estimation of parasitaemia. In the future, they can lead to efficient and accurate diagnostic tools, revolutionizing the urgent diagnosis of malaria. Other applications focus on the characterization of mosquito vectors by performing species identification, behavior, and biology descriptions. If field-validated, these promising approaches will facilitate the epidemiological monitoring of malaria vectors and saving resources by preventing or reducing malaria transmission.

@InBook{Nabet2022, 
  title = {{Artificial Intelligence and Malaria}},
  author = {Nabet, Cécile and Acherar, Aniss and Huguenin, Antoine and Tannier, Xavier and Piarroux, Renaud},
  booktitle = {Artificial Intelligence in Medicine}, 
  year = {2022}, 
  month = aug, 
  publisher = {Springer International Publishing}, 
  editor = {Lidströmer, Niklas and Ashrafian, Hutan}, 
  pages = {1353--1368}, 
  doi = {10.1007/978-3-030-64573-1_273}
}

Denizhan Demirkol, Şamil Aktaş, Tuncay Özcan, Xavier Tannier, Çiğdem Selçukcan Erol.

Analysis of risk factors for amputation in patients with diabetic foot ulcers: a cohort study from a tertiary center.

Acta Orthopaedica et Traumatologica Turcica. Vol. 56, Issue 5, September 2022. doi: 10.5152/j.aott.2022.22052

ⓘ [abstract] [BibTeX] [Link] [Ask me!]

Objective: This study aimed to analyze risk factors for amputation (overall, minor and major) in patients with diabetic foot ulcers (DFUs).
Methods: 407 patients with DFUs (286 male, 121 female; mean age = 60, age range = 32-92) who were managed in a tertiary care centre from 2009 to 2019 were retrospectively identified and included in the study. DFUs were categorized based on the Meggit-Wagner, PEDIS, S(AD)SAD, and University of Texas (UT) classification systems. To identify amputation risk-related factors, results of patients with DFUs who underwent amputations (minor or major) were compared to those who received other adjunctive treatments using Chi-Square, oneway analysis of variance (ANOVA) and Spearman correlation analysis.
Results: The mean C-reactive protein (CRP) and White Blood Cell (WBC) values were significantly higher in patients with major or minor amputation than in those without amputation. The mean Neutrophil (PNL), Platelets (PLT), wound width, creatinine and sedimentation (ESR) values were significantly higher in patients with major amputation compared to other groups of patients. Elevated levels of Highdensity lipoprotein (HDL), Hemoglobin (HGB) and albumin were determined to be protective factors against the risk of amputation. Spearman correlation analysis revealed a positive-sided, strong-levelled, significant relation between Wagner grades and amputation status of patients.
Conclusion: This study has identified specific factors for major and minor amputation risk of patients with DFUs. Especially infection markers such as CRP, WBC, ESR and PNL were higher in the amputation group. Most importantly, Meggit Wagner, one of the four different classification systems used in the DFUs, was determined to be highly associated with patients’ amputation risk.

@Article{Demirkol2022, 
  title = {{Analysis of risk factors for amputation in patients with diabetic foot ulcers: a cohort study from a tertiary center}},
  author = {Denizhan Demirkol and Şamil Aktaş and Tuncay Özcan and Xavier Tannier and Çiğdem Selçukcan Erol},
  number = {5}, 
  year = {2022}, 
  month = sep, 
  journal = {Acta Orthopaedica et Traumatologica Turcica}, 
  volume = {56}, 
  doi = {10.5152/j.aott.2022.22052}
}

Emmanuelle Kempf, Sonia Priou, Guillaume Lamé, Christel Daniel, Ali Bellamine, Daniele Sommacale, Yazid Belkacemi, Romain Bey, Gilles Galula, Namik Taright, Xavier Tannier, Bastien Rance, Rémi Flicoteaux, François Hemery, Etienne Audureau, Gilles Chatellier, Christophe Tournigand.

Impact of two waves of Sars-Cov2 outbreak on the number, clinical presentation, care trajectories and survival of patients newly referred for a colorectal cancer: A French multicentric cohort study from a large group of University hospitals.

International Journal of Cancer. Vol. 150, Issue 10, January 2022. doi: 10.1002/ijc.33928

ⓘ [abstract] [BibTeX] [Wiley link] [Ask me!]

The SARS-Cov2 may have impaired care trajectories, patient overall survival (OS), tumor stage at initial presentation for new colorectal cancer (CRC) cases. This study aimed at assessing those indicators before and after the beginning of the pandemic in France.
In this retrospective cohort study, we collected prospectively the clinical data of the 11.4 million of patients referred to the Greater Paris University Hospitals (AP HP). We identified new CRC cases between January first 2018 and December 31st 2020, and compared indicators for 2018-2019 to 2020. pTNM tumor stage was extracted from postoperative pathology reports for localized colon cancer, and metastatic status was extracted from CT-scan baseline text reports.
Between 2018 and 2020, 3602 and 1083 new colon and rectal cancers were referred to the APHP, respectively.
The 1-year OS rates reached 94%, 93% and 76% for new CRC patients undergoing a resection of the primary tumor, in 2018-2019, in 2020 without any Sars-Cov2 infection and in 2020 with a Sars-Cov2 infection, respectively (HR 3.78, 95%CI 2.1-7.1). For patients undergoing other kind of anticancer treatment, the percentages are 64%, 66% and 27% (HR 2.1, 95%CI 1.4-3.3).
Tumor stage at initial presentation, emergency level of primary tumor resection, delays between the first multidisciplinary meeting and the first anticancer treatment did not differ over time.
The SARS-Cov2 pandemic has been associated with less newly diagnosed CRC patients and worse 1-yr OS rates attributable to the infection itself rather than to its impact on hospital care delivery or tumor stage at initial presentation.

@Article{Kempf2022, 
  title = {{Impact of two waves of Sars-Cov2 outbreak on the number, clinical presentation, care trajectories and survival of patients newly referred for a colorectal cancer: A French multicentric cohort study from a large group of University hospitals}},
  author = {Emmanuelle Kempf and Sonia Priou and Guillaume Lamé and Christel Daniel and Ali Bellamine and Daniele Sommacale and Yazid Belkacemi and Romain Bey and Gilles Galula and Namik Taright and Xavier Tannier and Bastien Rance and Rémi Flicoteaux and François Hemery and Etienne Audureau and Gilles Chatellier and Christophe Tournigand},
  number = {10}, 
  year = {2022}, 
  month = jan, 
  journal = {International Journal of Cancer}, 
  volume = {150}, 
  doi = {10.1002/ijc.33928}
}

Emmanuelle Kempf, Guillaume Lamé, Richard Layese, Sonia Priou, Gilles Chatellier, Hedi Chaieb, Marc-Antoine Benderra, Ali Bellamine, Romain Bey, Stéphane Bréant, Gilles Galula, Namik Taright, Xavier Tannier, Thomas Guyet, Elisa Salamanca, Etienne Audureau, Christel Daniel, Christophe Tournigand.

New cancer cases at the time of SARS-Cov2 pandemic and related public health policies: A persistent and concerning decrease long after the end of national lockdown.

European Journal of Cancer. 150, pages 260-267, February 2021. doi: 10.1016/j.ejca.2021.02.015

ⓘ [abstract] [BibTeX] [ScienceDirect link] [Ask me!]

Introduction: The dissemination of SARS-Cov2 may have delayed the diagnosis of new cancers. This study aimed at assessing the number of new cancers during and after the lockdown.
Methods: We collected prospectively the clinical data of the 11.4 million of patients referred to the Assistance Publique Hôpitaux de Paris Teaching Hospital. We identified new cancer cases between January 1st 2018 and September 31st 2020, and compared indicators for 2018 and 2019 to 2020 with a focus on the French lockdown (March 17th to May 11th, 2020), across cancer types and patient age classes.
Results: Between January and September, 28,348, 27,272 and 23,734 new cancer cases were identified in 2018, 2019 and 2020, respectively. The monthly median number of new cases reached 3,168 (interquartile range, IQR, 3,027; 3,282), 3,054 (IQR 2,945; 3,127) and 2,723 (IQR 2,085; 2,2,863) in 2018, 2019 and 2020, respectively. From March 1st to May 31st, new cancer decreased by 30% in 2020 compared to the 2018-19 average; then by 9% from June 1st to September 31st. This evolution was consistent across all tumor types: -30% and -9% for colon, -27% and -6% for lung, -29% and -14% for breast, -33% and -12% for prostate cancers, respectively. For patients aged < 70 years, the decrease of colorectal and breast new cancers in April between 2018-2019 average and 2020 reached 41 % and 39%, respectively.
Conclusion: The SARS-Cov2 pandemic led to a substantial decrease of new cancer cases. Delays in cancer diagnoses may affect clinical outcomes in the coming years.

@Article{Kempf2021, 
  title = {{New cancer cases at the time of SARS-Cov2 pandemic and related public health policies: A persistent and concerning decrease long after the end of national lockdown}},
  author = {Emmanuelle Kempf and Guillaume Lamé and Richard Layese and Sonia Priou and Gilles Chatellier and Hedi Chaieb and Marc-Antoine Benderra and Ali Bellamine and Romain Bey and Stéphane Bréant and Gilles Galula and Namik Taright and Xavier Tannier and Thomas Guyet and Elisa Salamanca and Etienne Audureau and Christel Daniel and Christophe Tournigand},
  year = {2021}, 
  month = feb, 
  journal = {European Journal of Cancer}, 
  volume = {150}, 
  pages = {260-267}, 
  doi = {10.1016/j.ejca.2021.02.015}
}

Sergio Torres Aguilar, Pierre Chastang, Xavier Tannier.

Automatic medieval charters structure detection : A Bi-LSTM linear segmentation approach.

Journal of Data Mining & Digital Humanities. 2022, October 2022. doi: 10.46298/jdmdh.8646

ⓘ [abstract] [BibTeX] [Link to free copy]

This paper presents a model aiming to automatically detect sections in medieval Latin charters. These legal sources are some of the most important sources for medieval studies as they reflect economic and social dynamics as well as legal and institutional writing practices. An automatic linear segmentation can greatly facilitate charter indexation and speed up the recovering of evidence to support historical hypothesis by the means of granular inquiries on these raw, rarely structured sources. Our model is based on a Bi-LSTM approach using a final CRF-layer and was trained using a large, annotated collection of medieval charters (4,700 documents) coming from Lombard monasteries: the CDLM corpus (11th-12th centuries). The evaluation shows a high performance in most sections on the test-set and on an external evaluation corpus consisting of the Montecassino abbey charters (10th-12th centuries). We describe the architecture of the model, the main problems related to the treatment of medieval Latin and formulaic discourse, and we discuss some implications of the results in terms of record-keeping practices in High Middle Ages.

@Article{TorresAguilar2022, 
  title = {{Automatic medieval charters structure detection : A Bi-LSTM linear segmentation approach}},
  author = {Torres Aguilar, Sergio and Chastang, Pierre and Tannier, Xavier},
  year = {2022}, 
  month = oct, 
  journal = {Journal of Data Mining & Digital Humanities}, 
  volume = {2022}, 
  doi = {10.46298/jdmdh.8646}
}

Pierre Chastang, Sergio Torres Aguilar, Xavier Tannier.

A Named Entity Recognition Model for Medieval Latin Charters.

Digital Humanities Quarterly. Vol. 15, Issue 4, November 2021.

ⓘ [abstract] [BibTeX] [DHQ free link]

Named entity recognition is an advantageous technique with an increasing presence in digital humanities. In theory, automatic detection and recovery of named entities can provide new ways of looking up unedited information in edited sources and can allow the parsing of a massive amount of data in a short time for supporting historical hypotheses. In this paper, we detail the implementation of a model for automatic named entity recognition in medieval Latin sources and we test its robustness on different datasets. Different models were trained on a vast dataset of Burgundian diplomatic charters from the 9th to 14th centuries and validated by using general and century ad hoc models tested on short sets of Parisian, English, Italian and Spanish charters. We present the results of cross-validation in each case and we discuss the implications of these results for the history of medieval place-names and personal names.

@Article{Chastang2021, 
  title = {{A Named Entity Recognition Model for Medieval Latin Charters}},
  author = {Chastang, Pierre and Torres Aguilar, Sergio and Tannier, Xavier},
  number = {4}, 
  year = {2021}, 
  month = nov, 
  journal = {Digital Humanities Quarterly}, 
  volume = {15}
}

Charlotte Rudnik, Thibault Ehrhart, Olivier Ferret, Denis Teyssou, Raphaël Troncy, Xavier Tannier.

Searching News Articles Using an Event Knowledge Graph Leveraged by Wikidata.

in Proceedings of the Wiki Workshop 2019 (The Web Conference). San Francisco, USA, May 2019.

ⓘ [abstract] [BibTeX] [arXiv]

News agencies produce thousands of multimedia stories describing events happening in the world that are either scheduled such as sports competitions, political summits and elections, or breaking events such as military conflicts, terrorist attacks, natural disasters, etc. When writing up those stories, journalists refer to contextual background and to compare with past similar events. However, searching for precise facts described in stories is hard. In this paper, we propose a general method that leverages the Wikidata knowledge base to produce semantic annotations of news articles. Next, we describe a semantic search engine that supports both keyword based search in news articles and structured data search providing filters for properties belonging to specific event schemas that are automatically inferred.

@InProceedings{Rudnik2019, 
  title = {{Searching News Articles Using an Event Knowledge Graph Leveraged by Wikidata}},
  author = {Rudnik, Charlotte and Ehrhart, Thibault and Ferret, Olivier and Teyssou, Denis and Troncy, Raphaël and Tannier, Xavier},
  booktitle = {Proceedings of the Wiki Workshop 2019 (The Web Conference)}, 
  address = {San Francisco, USA}, 
  year = {2019}, 
  month = may
}

Swen Ribeiro, Olivier Ferret, Xavier Tannier.

Unsupervised Event Clustering and Aggregation from Newswire and Web Articles.

in Proceedings of the 2nd workshop "Natural Language meets Journalism" (EMNLP 2017). Copenhagen, Denmark, September 2017.

ⓘ [abstract] [BibTeX] [ACL Anthology]

In this paper we present an unsupervised pipeline approach for clustering news articles based on identified event instances in their content. We leverage press agency newswire and monolingual word alignment techniques to build meaningful and linguistically varied clusters of articles from the Web in the perspective of a broader event type detection task. We validate our approach on a manually annotated corpus of Web articles.

@InProceedings{Ribeiro2017, 
  title = {{Unsupervised Event Clustering and Aggregation from Newswire and Web Articles}},
  author = {Ribeiro, Swen and Ferret, Olivier and Tannier, Xavier},
  booktitle = {Proceedings of the 2nd workshop "Natural Language meets Journalism" (EMNLP 2017)}, 
  address = {Copenhagen, Denmark}, 
  year = {2017}, 
  month = sep
}

Xavier Tannier.

NLP-driven Data Journalism: Time-Aware Mining and Visualization of International Alliances.

in Proceedings of "Natural Language meets Journalism", workshop of the International Joint Conference on Artificial Intelligence (IJCAI 2016). New York, USA, July 2016.

ⓘ [abstract] [BibTeX] [poster] [paper]

We take inspiration of computational and data journalism, and propose to combine techniques from information extraction, information aggregation and visualization to build a tool identifying the evolution of alliance and opposition relations between countries, on specific topics. These relations are aggregated into numerical data that are visualized by time-series plots or dynamic graphs.

@InProceedings{Tannier2016a, 
  title = {{NLP-driven Data Journalism: Time-Aware Mining and Visualization of International Alliances}},
  author = {Xavier Tannier},
  booktitle = {Proceedings of "Natural Language meets Journalism", workshop of the International Joint Conference on Artificial Intelligence (IJCAI 2016)}, 
  address = {New York, USA}, 
  year = {2016}, 
  month = jul
}

Xavier Tannier, Frédéric Vernier.

Creation, Visualization and Edition of Timelines for Journalistic Use.

in Proceedings of "Natural Language meets Journalism", workshop of the International Joint Conference on Artificial Intelligence (IJCAI 2016). New York, USA, July 2016.

ⓘ [abstract] [BibTeX] [paper] [slides]

We describe in this article a system for building and visualizing thematic timelines automatically. The input of the system is a set of keywords, together with temporal user-specified boundaries. The output is a timeline graph showing at the same time the chronology and the importance of the events concerning the query. This requires natural language processing and information retrieval techniques, allied to a very specific temporal smoothing and visualization approach. The result can be edited so that the journalist always has the final say on what is finally displayed to the reader.

@InProceedings{Tannier2016b, 
  title = {{Creation, Visualization and Edition of Timelines for Journalistic Use}},
  author = {Xavier Tannier and Frédéric Vernier},
  booktitle = {Proceedings of "Natural Language meets Journalism", workshop of the International Joint Conference on Artificial Intelligence (IJCAI 2016)}, 
  address = {New York, USA}, 
  year = {2016}, 
  month = jul
}

Sergio Torres Aguilar, Xavier Tannier, Pierre Chastang.

Named entity recognition applied on a data base of Medieval Latin charters. The case of chartae burgundiae.

in Proceedings of the 3rd International Workshop on Computational History (HistoInformatics 2016). Krakow, Poland, July 2016.

ⓘ [abstract] [BibTeX] [paper]

We take inspiration of computational and data journalism, and propose to combine techniques from information extraction, information aggregation and visualization to build a tool identifying the evolution of alliance and opposition relations between countries, on specific topics. These relations are aggregated into numerical data that are visualized by time-series plots or dynamic graphs.

@InProceedings{Torres2016, 
  title = {{Named entity recognition applied on a data base of Medieval Latin charters. The case of chartae burgundiae}},
  author = {Torres Aguilar, Sergio and Tannier, Xavier and Chastang, Pierre},
  booktitle = {Proceedings of the 3rd International Workshop on Computational History (HistoInformatics 2016)}, 
  address = {Krakow, Poland}, 
  year = {2016}, 
  month = jul
}

Raphaël Bonaque, Tien-Duc Cao, Bogdan Cautis, François Goasdoué, Javier Letelier, Ioana Manolescu, Oscar Mendoza, Swen Ribeiro, Xavier Tannier, Michael Thomazo.

Mixed-instance querying: a lightweight integration architecture for data journalism.

in Proceedings of the Conference on Very Large Databases (Demonstrations Track, PVLDB 2016). New Delhi, India, September 2016.

ⓘ [abstract] [BibTeX] [free copy]

As the world’s affairs get increasingly more digital, timely production and consumption of news require to efficiently and quicklyexploit heterogeneous data sources. Discussions with journalistsrevealed that content management tools currently at their disposalfall very short of expectations. We demonstrate TATOOINE, a lightweight data integration prototype, which allows to quickly set upintegration queries across (very) heterogeneous data sources, capitalizing on the many data links (joins) available in this applicationdomain. Our demonstration is based on scenarios we study in collaboration with Le Monde, France’s major newspaper.

@InProceedings{Bonaque2016, 
  title = {{Mixed-instance querying: a lightweight integration architecture for data journalism}},
  author = {Bonaque, Raphaël and Cao, Tien-Duc and Cautis, Bogdan and Goasdoué, François and Letelier, Javier and Manolescu, Ioana and Mendoza, Oscar and Ribeiro, Swen and Tannier, Xavier and Thomazo, Michael},
  booktitle = {Proceedings of the Conference on Very Large Databases (Demonstrations Track, PVLDB 2016)}, 
  address = {New Delhi, India}, 
  year = {2016}, 
  month = sep
}

Chyrine Tahri, Xavier Tannier, Patrick Haouat.

On the portability of extractive Question-Answering systems on scientific papers to real-life application scenarios.

in Proceedings of the AACL Workshop on Information Extraction from Scientific Publications. pages 67-77, November 2022. © Association for Computational Linguistics.

ⓘ [abstract] [BibTeX] [ACL link]

There are still hurdles standing in the way of faster and more efficient knowledge consumption in industrial environments seeking to foster innovation. In this work, we address the portability of extractive Question Answering systems from academic spheres to industries basing their decisions on thorough scientific papers analysis. Keeping in mind that such industrial contexts often lack high-quality data to develop their own QA systems, we illustrate the misalignment between application requirements and cost sensitivity of such industries and some widespread practices tackling the domain-adaptation problem in the academic world. Through a series of extractive QA experiments on QASPER, we adopt the pipeline-based retriever-ranker-reader architecture for answering a question on a scientific paper and show the impact of modeling choices in different stages on the quality of answer prediction. We thus provide a characterization of practical aspects of real-life application scenarios and notice that appropriate trade-offs can be efficient and add value in those industrial environments.

@InProceedings{Tahri2022, 
  title = {{On the portability of extractive Question-Answering systems on scientific papers to real-life application scenarios}},
  author = {Tahri, Chyrine and Tannier, Xavier and Haouat, Patrick},
  booktitle = {Proceedings of the AACL Workshop on Information Extraction from Scientific Publications}, 
  year = {2022}, 
  month = nov, 
  publisher = {Association for Computational Linguistics}, 
  pages = {67-77}
}

Patrice Bellot, Véronique Moriceau, Josiane Mothe, Éric SanJuan, Xavier Tannier.

INEX Tweet Contextualization Task: Evaluation, Results and Lessons Learned.

Information Processing and Management. 2016.

ⓘ [abstract] [BibTeX] [ScienceDirect] [Ask me!]

Microblogging platforms such as Twitter are increasingly used for on-line client and market analysis. This motivated the proposalof a new track at CLEF INEX lab of Tweet Contextualization. The objective of this task was to help a user to understand a tweetby providing him with a short explanatory summary (500 words). This summary should be built automatically using resources likeWikipedia and generated by extracting relevant passages and aggregating them into a coherent summary.Running for four years, results show that the best systems combine NLP techniques with more traditional methods. Moreprecisely the best performing systems combine passage retrieval, sentence segmentation and scoring, named entity recognition,text part-of-speech (POS) analysis, anaphora detection, diversity content measure as well as sentence reordering.This paper provides a full summary report on the four-year long task. While yearly overviews focused on system results, inthis paper we provide a detailed report on the approaches proposed by the participants and which can be considered as the stateof the art for this task. As an important result from the 4 years competition, we also describe the open access resources that havebeen built and collected. The evaluation measures for automatic summarization designed in DUC or MUC were not appropriate toevaluate tweet contextualization, we explain why and depict in detailed the LogSim measure used to evaluate informativeness ofproduced contexts or summaries. Finally, we also mention the lessons we learned and that it is worth considering when designinga task.

@Article{Bellot2016, 
  title = {{INEX Tweet Contextualization Task: Evaluation, Results and Lessons Learned}},
  author = {Patrice Bellot and Véronique Moriceau and Josiane Mothe and Éric SanJuan and Xavier Tannier},
  year = {2016}, 
  journal = {Information Processing and Management}
}

Patrice Bellot, Véronique Moriceau, Josiane Mothe, Éric SanJuan, Xavier Tannier.

Mesures d’informativité et de lisibilité pour un cadre d’évaluation de la contextualisation de tweets.

Document Numérique. Vol. 18, Issue 1, pages 55-73, 2015.

ⓘ [abstract] [BibTeX] [Ask me!] [Link]

This paper deals with tweet contextualization evaluation. Text contextualization is defined as providing the reader with a summary allowing a reader to understand a short text that, because of its size is not self-contained. A general evaluation framework for tweet contextualization or other type of short texts is defined. We propose a collection benchmark as well as the appropriate evaluation measures. This framework has been experimented in INEX Tweet Contextualisation track. We discuss these measures and participants’ results.

Cet article s’intéresse à l’évaluation de la contextualisation de tweets. La contextualisation est définie comme un résumé permettant de remettre en contexte un texte qui, de par sa taille, ne contient pas l’ensemble des éléments qui permettent à un lecteur de comprendre son contenu. Nous définissons un cadre d’évaluation pour la contextualisation de tweets généralisable à d’autres textes courts. Nous proposons une collection de référence ainsi que des mesures d’évaluation ad hoc. Ce cadre d’évaluation a été expérimenté avec succès dans le contexte de la campagne INEX Tweet Contextualization. Au regard des résultats obtenus lors de cette campagne, nous discutons ici les mesures proposées et les résultats obtenus par les participants.

@Article{Bellot2015a, 
  title = {{Mesures d’informativité et de lisibilité pour un cadre d’évaluation de la contextualisation de tweets}},
  author = {Patrice Bellot and Véronique Moriceau and Josiane Mothe and Éric SanJuan and Xavier Tannier},
  number = {1}, 
  year = {2015}, 
  journal = {Document Numérique}, 
  volume = {18}, 
  pages = {55-73}
}

Xavier Tannier.

Traitement des événements et ciblage d'information.

June 2014. Habilitation à Diriger des Recherches (HDR)

ⓘ [abstract] [BibTeX] [Slides in pdf] [Slides in pptx] [Thesis]

Dans ce mémoire, nous organisons nos travaux principaux autour de quatre axes de traitement des informations textuelles : le ciblage, l'agrégation, la hiérarchisation et la contextualisation d'information. La majeure partie du document est dédiée à l'analyse des événements. Nous introduisons d'abord la notion d'événement à travers les diverses spécialités du traitement automatique des langues qui s'en sont préoccupées. Nous proposons ainsi un survol des différents modes de représentation des événements, tout en instaurant un fil rouge pour l'ensemble de la première partie. Nous distinguons ensuite deux grand es classes de travaux autour des événements, deux grandes visions que nous avons nommées, pour la première, l'"événement dans le texte", et pour la seconde, l'"événement dans le monde". Dans la première, nous considérons l'événement comme la désignation linguistique de quelque chose qui se passe, et nous tentons d'une part d'identifier ces désignations dans les textes, et d'autre part d'induire les relations temporelles existant entre ces événements, que ce soit dans des textes journalistiques ou médicaux. Nous réfléchissons enfin à une métrique d'évaluation adaptée à ce type d'informations. Pour ce qui est de l'"événement dans le monde", nous envisageons plus l'événement tel qu'il est perçu par le citoyen, et nous proposons plusieurs approches originales pour aider celui-ci à mieux appréhender la quantité écrasante d'événements dont il prend connaissance chaque jour : les chronologies thématiques, les fils temporels, et une approche automatisée du journalisme de données. La deuxième partie revient sur des travaux en lien avec le ciblage d'information. Nous décrivons tout d'abord nos travaux sur les systèmes de questions-réponses, dans les quels nous avons eu recours à l'analyse syntaxique pour aider à justifier les réponses trouvées à une question en langage naturel. Enfin, nous abordons le sujet de la collecte thématique de documents sur le Web, dans le but de créer automatiquement des corpus et des lexiques spécialisés. Enfin, nous concluons et revenons sur les perspectives associées aux travaux présentés sur les événements, avec pour but d'abolir partiellement la frontière qui séparent les différents axes présentés.

@Misc{Tannier2014b, 
  title = {{Traitement des événements et ciblage d'information}},
  author = {Xavier Tannier},
  year = {2014}, 
  month = jun, 
  school = {Université Paris-Sud, École Doctorale d'Informatique}, 
  howpublished = {Habilitation à Diriger des Recherches (HDR)}, 
  note = {}
}

Delphine Bernhard, Louis De Viron, Véronique Moriceau, Xavier Tannier.

Question Generation for French: Collating Parsers and Paraphrasing Questions.

Dialogue & Discourse, Special Issue on Question Generation. Vol. 3, Issue 2, pages 43-74, 2012.

ⓘ [abstract] [BibTeX] [free copy]

This article describes a question generation system for French. The transformation of declarative sentences into questions relies on two different syntactic parsers and named entity recognition tools. This makes it possible to further diversify the questions generated and to possibly alleviate the problems inherent to the analysis tools. The system also generates reformulations for the questions based on variations in the question words, inducing answers with different granularities, and nominalisations of action verbs. We evaluate the questions generated for sentences extracted from two different corpora: a corpus of newspaper articles used for the CLEF Question Answering evaluation campaign and a corpus of simplified online encyclopedia articles. The evaluation shows that the system is able to generate a majority of good and medium quality questions. We also present an original evaluation of the question generation system using the question analysis module of a question answering system.

@Article{Bernhard2012, 
  title = {{Question Generation for French: Collating Parsers and Paraphrasing Questions}},
  author = {Delphine Bernhard and Louis De Viron and Véronique Moriceau and Xavier Tannier},
  number = {2}, 
  year = {2012}, 
  journal = {Dialogue & Discourse, Special Issue on Question Generation}, 
  volume = {3}, 
  pages = {43-74}
}

Éric SanJuan, Véronique Moriceau, Xavier Tannier, Patrice Bellot, Josiane Mothe.

Overview of the INEX 2011 Question Answering Track (QA@INEX).

in Focused Retrieval of Content and Structure, 10th International Workshop of the Inititative for the Evaluation of XML Retrieval, INEX 2011. pages 269-281, 2012. Lecture Notes in Computer Science (LNCS 7424).

ⓘ [BibTeX] [Ask me!] [SpringerLink]

In this article, we present a novel graph-based approach for pseudo-relevance feedback. We model term co-occurences in a fixed window or at the document level as a graph and apply a random walk algorithm to select expansion terms. Evaluation of the proposed approach on several standard TREC and CLEF collections including the recent TREC-Microblog dataset show that the proposed approach is competitive with state-of-the-art pseudo-relevance feedback models.

@InProceedings{SanJuan2012b, 
  title = {{Overview of the INEX 2011 Question Answering Track (QA@INEX)}},
  author = {\'Eric SanJuan and Véronique Moriceau and Xavier Tannier and Patrice Bellot and Josiane Mothe},
  booktitle = {Focused Retrieval of Content and Structure, 10th International Workshop of the Inititative for the Evaluation of XML Retrieval, INEX 2011}, 
  year = {2012}, 
  series = {Lecture Notes in Computer Science}, 
  volume = {LNCS 7424}, 
  editor = {Shlomo Geva and Jaap Kamps and Ralf Schenkel}, 
  pages = {269-281}
}

Éric SanJuan, Véronique Moriceau, Xavier Tannier, Patrice Bellot, Josiane Mothe.

Overview of the INEX 2012 Tweet Contextualization Track.

in Working Notes for the CLEF 2012 Workshop. Rome (Italy), September 2012.

ⓘ [abstract] [BibTeX] [free copy]

The use case of the Tweet Contextualization task is the following: given a new tweet, participating systems must provide some context about the subject of a tweet, in order to help the reader to understand it. In this task, contextualizing tweets consists in answering questions of the form "what is this tweet about?" which can be answered by several sentences or by an aggregation of texts from different documents of the Wikipedia. Thus, tweet analysis, XML/passage retrieval and automatic summarization are combined in order to get closer to real information needs. This article describes the data sets and topics, the metrics used for the evaluation of the systems submissions, as well as the results that they obtained.

@InProceedings{SanJuan2012a, 
  title = {{Overview of the INEX 2012 Tweet Contextualization Track}},
  author = {\'Eric SanJuan and Véronique Moriceau and Xavier Tannier and Patrice Bellot and Josiane Mothe},
  booktitle = {Working Notes for the CLEF 2012 Workshop}, 
  address = {Rome (Italy)}, 
  year = {2012}, 
  month = sep
}

P. Bellot, T. Chappell, A. Doucet, S. Geva , S. Gurajada, J. Kamps, G. Kazai, M. Koolen, M. Landoni, M. Marx, A. Mishra, V. Moriceau, J. Mothe, M. Preminger, G. Ramírez, M. Sanderson, E. Sanjuan, F. Scholer, X. Tannier, M. Theobald, M. Trappett, A. Trotman, Q. Wang.

Report on INEX 2012.

SIGIR Forum. Vol. 46, Issue 2, pages 50-59, December 2012.

ⓘ [abstract] [BibTeX] [free copy]

INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX'12 evaluation campaign, which consisted of a five tracks: Linked Data, Relevance Feedback, Snippet Retrieval, Social Book Search, and Tweet Contextualization. INEX'12 was an exciting year for INEX in which we joined forces with CLEF and for the first time ran our workshop as part of the CLEF labs in order to facilitate knowledge transfer between the evaluation forums. INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2011 evaluation campaign, which consisted of a five active tracks: Books and Social Search, Data Centric, Question Answering, Relevance Feedback, and Snippet Retrieval. INEX 2011 saw a range of new tasks and tracks, such as Social Book Search, Faceted Search, Snippet Retrieval, and Tweet Contextualization.

@Article{Bellot12b, 
  title = {{Report on INEX 2012}},
  author = {P. Bellot and T. Chappell and A. Doucet and S. Geva  and S. Gurajada and J. Kamps and  G. Kazai and M. Koolen and M. Landoni and M. Marx and A. Mishra and V. Moriceau and J. Mothe and M. Preminger and G. Ramírez and M. Sanderson and E. Sanjuan and F. Scholer and X. Tannier and M. Theobald and M. Trappett and A. Trotman and Q. Wang},
  number = {2}, 
  year = {2012}, 
  month = dec, 
  journal = {SIGIR Forum}, 
  volume = {46}, 
  pages = {50-59}
}

P. Bellot, T. Chappell, A. Doucet, S. Geva , J. Kamps, G. Kazai, M. Koolen, M. Landoni, M. Marx, V. Moriceau, J. Mothe, G. Ramírez, M. Sanderson, E. Sanjuan, F. Scholer, X. Tannier, M. Theobald, M. Trappett, A. Trotman, Q. Wang.

Report on INEX 2011.

SIGIR Forum. Vol. 46, Issue 1, pages 33-42, June 2012.

ⓘ [abstract] [BibTeX] [free copy]

INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2011 evaluation campaign, which consisted of a five active tracks: Books and Social Search, Data Centric, Question Answering, Relevance Feedback, and Snippet Retrieval. INEX 2011 saw a range of new tasks and tracks, such as Social Book Search, Faceted Search, Snippet Retrieval, and Tweet Contextualization. INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2011 evaluation campaign, which consisted of a five active tracks: Books and Social Search, Data Centric, Question Answering, Relevance Feedback, and Snippet Retrieval. INEX 2011 saw a range of new tasks and tracks, such as Social Book Search, Faceted Search, Snippet Retrieval, and Tweet Contextualization.

@Article{Bellot12a, 
  title = {{Report on INEX 2011}},
  author = {P. Bellot and T. Chappell and A. Doucet and S. Geva  and J. Kamps and  G. Kazai and M. Koolen and M. Landoni and M. Marx and V. Moriceau and J. Mothe and G. Ramírez and M. Sanderson and E. Sanjuan and F. Scholer and X. Tannier and M. Theobald and M. Trappett and A. Trotman and Q. Wang},
  number = {1}, 
  year = {2012}, 
  month = jun, 
  journal = {SIGIR Forum}, 
  volume = {46}, 
  pages = {33-42}
}

Éric SanJuan, Véronique Moriceau, Xavier Tannier, Patrice Bellot, Josiane Mothe.

Overview of the INEX 2011 Question Answering Track (QA@INEX).

in Pre-proceedings of the INitiative for the Evaluation of XML retrieval workshop (INEX 2011). Saarbrücken (Germany), pages 145-153, December 2011.

ⓘ [abstract] [BibTeX] [Proceedings]

The INEX QA track (QA@INEX) aims to evaluate a complex question-answering task. In such a task, the set of questions is composed of complex questions that can be answered by several sentences or by an aggregation of texts from different documents. Question-answering, XML/passage retrieval and automatic summarization are combined in order to get closer to real information needs. Based on the groundwork carried out in 2009-2010 edition to determine the sub-tasks and a novel evaluation methodology, the 2011 edition of the track is contextualizing tweets using a recent cleaned dump of the Wikipedia.

@InProceedings{SanJuan2011b, 
  title = {{Overview of the INEX 2011 Question Answering Track (QA@INEX)}},
  author = {\'Eric SanJuan and Véronique Moriceau and Xavier Tannier and Patrice Bellot and Josiane Mothe},
  booktitle = {Pre-proceedings of the INitiative for the Evaluation of XML retrieval workshop (INEX 2011)}, 
  address = {Saarbrücken (Germany)}, 
  year = {2011}, 
  month = dec, 
  pages = {145-153}
}

Louis de Viron, Delphine Bernhard, Véronique Moriceau, Xavier Tannier.

Génération automatique de questions à partir de textes en français.

in Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2011, article court). Montpellier, France, July 2011.

ⓘ [abstract] [BibTeX] [paper]

In this article, we present an automatic question generation system for French. The system proceeds by transforming declarative sentences into interrogative sentences, based on a preliminary syntactic analysis of the base sentence. We detail the different types of questions generated. We also present an evaluation of the tool, which shows that 41% of the questions generated by the system are perfectly well-formed.

Nous présentons dans cet article un générateur automatique de questions pour le français. Le système de génération procède par transformation de phrases déclaratives en interrogatives et se base sur une analyse syntaxique préalable de la phrase de base. Nous détaillons les différents types de questions générées. Nous présentons également une évaluation de l'outil, qui démontre que 41 % des questions générées par le système sont parfaitement bien formées.

@InProceedings{DeViron11, 
  title = {{Génération automatique de questions à partir de textes en français}},
  author = {Louis de Viron and Delphine Bernhard and Véronique Moriceau and Xavier Tannier},
  booktitle = {Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2011, article court)}, 
  address = {Montpellier, France}, 
  year = {2011}, 
  month = jul
}

Éric SanJuan, Patrice Bellot, Véronique Moriceau, Xavier Tannier.

Overview of the INEX 2010 Question Answering Track (QA@INEX).

in Comparative Evaluation of Focused Retrieval, 9th International Workshop of the Inititative for the Evaluation of XML Retrieval, INEX 2010. Vugh, The Netherlands, pages 269-281, 2011. Lecture Notes in Computer Science (LNCS 6932).

ⓘ [BibTeX] [SpringerLink] [Ask me!]

In this article, we present an automatic question generation system for French. The system proceeds by transforming declarative sentences into interrogative sentences, based on a preliminary syntactic analysis of the base sentence. We detail the different types of questions generated. We also present an evaluation of the tool, which shows that 41% of the questions generated by the system are perfectly well-formed.

Nous présentons dans cet article un générateur automatique de questions pour le français. Le système de génération procède par transformation de phrases déclaratives en interrogatives et se base sur une analyse syntaxique préalable de la phrase de base. Nous détaillons les différents types de questions générées. Nous présentons également une évaluation de l'outil, qui démontre que 41 % des questions générées par le système sont parfaitement bien formées.

@InProceedings{SanJuan2011a, 
  title = {{Overview of the INEX 2010 Question Answering Track (QA@INEX)}},
  author = {\'Eric SanJuan and Patrice Bellot and Véronique Moriceau and Xavier Tannier},
  booktitle = {Comparative Evaluation of Focused Retrieval, 9th International Workshop of the Inititative for the Evaluation of XML Retrieval, INEX 2010}, 
  address = {Vugh, The Netherlands}, 
  year = {2011}, 
  series = {Lecture Notes in Computer Science}, 
  volume = {LNCS 6932}, 
  editor = {Shlomo Geva and Jaap Kamps and Ralf Schenkel and Andrew Trotman}, 
  pages = {269-281}
}

D. Alexander, P. Arvola, T. Beckers, P. Bellot, T. Chappell, C.M. De Vries, A. Doucet, N. Fuhr, S. Geva, J. Kamps, G. Kazai, M. Koolen, S. Kutty, M. Landoni, V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel, A. Tagarelli, X. Tannier, J.A. Thom, A. Trotman, J. Vaino, Q. Wang, C. Wu.

Report on INEX 2010.

SIGIR Forum. Vol. 45, Issue 1, pages 2-17, June 2011.

ⓘ [abstract] [BibTeX] [free copy]

INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2010 evaluation campaign, which consisted of a wide range of tracks: Ad Hoc, Book, Data Centric, Interactive, QA, Link the Wiki, Relevance Feedback, Web Service Discovery and XML Mining.

@Article{Alexander11, 
  title = {{Report on INEX 2010}},
  author = {D. Alexander and P. Arvola and T. Beckers and P. Bellot and T. Chappell and C.M. De Vries and A. Doucet and N. Fuhr and S. Geva and J. Kamps and G. Kazai and M. Koolen and S. Kutty and M. Landoni and V. Moriceau and R. Nayak and R. Nordlie and N. Pharo and E. SanJuan and R. Schenkel and A. Tagarelli and X. Tannier and J.A. Thom and A. Trotman and J. Vaino and Q. Wang and C. Wu},
  number = {1}, 
  year = {2011}, 
  month = jun, 
  journal = {SIGIR Forum}, 
  volume = {45}, 
  pages = {2-17}
}

Véronique Moriceau, Xavier Tannier.

FIDJI: Using Syntax for Validating Answers in Multiple Documents.

Information Retrieval, Special Issue on Focused Information Retrieval. Vol. 13, Issue 5, pages 507-533, October 2010. © Springer.

ⓘ [abstract] [BibTeX] [SpringerLink] [author version] [Ask me!]

This article presents FIDJI, a question-answering (QA) system for French. FIDJI combines syntactic information with traditional QA techniques such as named entity recognition and term weighting; it does not require any pre-processing other than classical search engine indexing. Among other uses of syntax, we experiment in this system the validation of answers through different documents, as well as specific techniques for answering different types of questions (e.g. yes/no or list questions). Several experiments are presented, aiming at evaluating the benefit from using syntactic analysis, as well as multi-document validation. Different types of questions and corpora are tested, and specificities are commented. Links with result aggregation are also discussed.

@ARTICLE{Moriceau2010a, 
  title = {{FIDJI: Using Syntax for Validating Answers in Multiple Documents}},
  author = {Véronique Moriceau and Xavier Tannier},
  number = {5}, 
  year = {2010}, 
  month = oct, 
  journal = {Information Retrieval, Special Issue on Focused Information Retrieval}, 
  publisher = {Springer}, 
  volume = {13}, 
  pages = {507-533}
}

Éric SanJuan, Patrice Bellot, Véronique Moriceau, Xavier Tannier.

Overview of the 2010 QA Track: Preliminary results.

in Pre-proceedings of the INitiative for the Evaluation of XML retrieval workshop (INEX 2010). Vught (Netherlands), pages 209-213, December 2010.

ⓘ [abstract] [BibTeX] [Proceedings]

The INEX QA track (QA@INEX) in 2009 - 2010 aims to evaluate a complex question-answering task using the Wikipedia. The set of questions is composed of factoid, precise questions that expect short answers, as well as more complex questions that can be answered by several sentences or by an aggregation of texts from different documents. This overview is centered on the long type answer QA@INEX sub track. The evaluation methodology based on word distribution divergence has allowed several summarization systems to participate. Lots of these sys- tems generated a readable extract of sentences from top ranked docu- ments by a state-of-the-art method. Some of the participants also tested several methods of question disambiguation. They have been evaluated on a pool of real questions from Nomao and Yahoo! Answers. Manual evaluation, as well as short type question task, are still running.

@InProceedings{SanJuan2010, 
  title = {{Overview of the 2010 QA Track: Preliminary results}},
  author = {\'Eric SanJuan and Patrice Bellot and Véronique Moriceau and Xavier Tannier},
  booktitle = {Pre-proceedings of the INitiative for the Evaluation of XML retrieval workshop (INEX 2010)}, 
  address = {Vught (Netherlands)}, 
  year = {2010}, 
  month = dec, 
  pages = {209-213}
}

Véronique Moriceau, Éric SanJuan, Xavier Tannier, Patrice Bellot.

Overview of the 2009 QA Track: Towards a Common Task for QA, Focused IR and Automatic Summarization Systems.

in Focused Retrieval and Evaluation, 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2009. Brisbane, Australia, pages 355-365, 2010. © Springer Verlag. Lecture Notes in Computer Science (LNCS 6203).

ⓘ [abstract] [BibTeX] [Ask me!] [SpringerLink]

QA@INEX aims to evaluate a complex question-answering task. In such a task, the set of questions is composed of factoid, precise questions that expect short answers, as well as more complex questions that can be answered by several sentences or by an aggregation of texts from different documents. Question-answering, XML/passage retrieval and automatic summarization are combined in order to get closer to real information needs. This paper presents the groundwork carried out in 2009 to determine the tasks and a novel evaluation methodology that will be used in 2010.

@InProceedings{Moriceau2010b, 
  title = {{Overview of the 2009 QA Track: Towards a Common Task for QA, Focused IR and Automatic Summarization Systems}},
  author = {Véronique Moriceau and \'Eric SanJuan and Xavier Tannier and Patrice Bellot},
  booktitle = {Focused Retrieval and Evaluation, 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2009}, 
  address = {Brisbane, Australia}, 
  year = {2010}, 
  publisher = {Springer Verlag}, 
  series = {Lecture Notes in Computer Science}, 
  volume = {LNCS 6203}, 
  pages = {355-365}
}

Xavier Tannier, Véronique Moriceau.

FIDJI@ResPubliQA 2010.

in Proceedings of Multiple Language Question Answering 2010 (MLQA10). Padua, Italy, September 2010.

ⓘ [abstract] [BibTeX] [free copy]

In this paper, we present the results obtained by the system FIDJI for both French and English monolingual evaluations, at ResPubliQA 2010 campaign. In this campaign, we focused on carrying on our evaluations concerning the contribution of our syntactic modules with this specific collection.

@InProceedings{Tannier2010b, 
  title = {{FIDJI@ResPubliQA 2010}},
  author = {Xavier Tannier and Véronique Moriceau},
  booktitle = {Proceedings of Multiple Language Question Answering 2010 (MLQA10)}, 
  address = {Padua, Italy}, 
  year = {2010}, 
  month = sep
}

Véronique Moriceau, Xavier Tannier, Mathieu Falco.

Une étude des questions "complexes" en question-réponse.

in Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2010, article court). Montréal, Canada, July 2010.

ⓘ [abstract] [BibTeX] [free copy]

Most question-answering systems have been designed to answer "factual" questions (short and precise answers as dates, person names, locations), and only a few researches concern complex questions. This article presents a typology of questions, including complex questions, as well as a typology of answers that should be expected for each type of questions. We also present preliminary experiments using these typologies for answering complex questions and leading to good results.

La plupart des systèmes de question-réponse ont été conçus pour répondre à des questions dites "factuelles" (réponses précises comme des dates, des noms de personne, des lieux), et peu se sont intéressés au traitement des questions complexes. Cet article présente une typologie des questions en y incluant les questions complexes, ainsi qu'une typologie des formes de réponses attendues pour chaque type de questions. Nous présentons également des expériences préliminaires utilisant ces typologies pour les questions complexes, avec de bons résultats.

@INPROCEEDINGS{Moriceau2010c, 
  title = {{Une étude des questions "complexes" en question-réponse}},
  author = {Véronique Moriceau and Xavier Tannier and Mathieu Falco},
  booktitle = {Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2010, article court)}, 
  address = {Montréal, Canada}, 
  year = {2010}, 
  month = jul
}

Thomas Beckers, Patrice Bellot, Gianluca Demartini, Ludovic Denoyer, Christopher M. De Vries, Antoine Doucet, Khairun Nisa Fachry, Norbert Fuhr, Patrick Gallinari, Shlomo Geva, Wei-Che Huang, Tereza Iofciu, Jaap Kamps, Gabriella Kazai, Marijn Koolen, Sangeetha Kutty, Monica Landoni, Miro Lehtonen, Véronique Moriceau, Richi Nayak, Ragnar Nordlie, Nils Pharo, Éric SanJuan, Ralf Schenkel, Xavier Tannier, Martin Theobald, James A. Thom, Andrew Trotman, Arjen P. de Vries.

Report on INEX 2009.

SIGIR Forum. Vol. 44, Issue 1, pages 38-57, 2010.

ⓘ [abstract] [BibTeX]

INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2009 evaluation campaign, which consisted of a wide range of tracks: Ad hoc, Book, Efficiency, Entity Ranking, Interactive, QA, Link the Wiki, and XML Mining. INEX in running entirely on volunteer effort by the IR research community: anyone with an idea and some time to spend, can have a major impact!

@Article{Beckers2010, 
  title = {{Report on INEX 2009}},
  author = {Thomas Beckers and Patrice Bellot and Gianluca Demartini and Ludovic Denoyer and Christopher M. De Vries and Antoine Doucet and Khairun Nisa Fachry and Norbert Fuhr and Patrick Gallinari and Shlomo Geva and Wei-Che Huang and Tereza Iofciu and Jaap Kamps and Gabriella Kazai and Marijn Koolen and Sangeetha Kutty and Monica Landoni and Miro Lehtonen and Véronique Moriceau and Richi Nayak and Ragnar Nordlie and Nils Pharo and \'Eric SanJuan and Ralf Schenkel and Xavier Tannier and Martin Theobald and James A. Thom and Andrew Trotman and Arjen P. de Vries},
  number = {1}, 
  year = {2010}, 
  journal = {SIGIR Forum}, 
  volume = {44}, 
  pages = {38-57}
}

Xavier Tannier, Véronique Moriceau.

FIDJI: Web Question-Answering at Quaero 2009.

in Proceedings of the Seventh International Language Resources and Evaluation (LREC'10). La Valette, Malta, May 2010. © ELRA.

ⓘ [abstract] [BibTeX] [free copy]

This paper presents the participation of FIDJI system to the Web Question-Answering evaluation campaign organized by Quaero in 2009. FIDJI is an open-domain question-answering system which combines syntactic information with traditional QA techniques such as named entity recognition and term weighting in order to validate answers through multiple documents. It was originally designed to process "clean" document collections. Overall results are significantly lower than in traditional campaigns but results (for French evaluation) are quite good compared to other state-of-the-art systems. They show that a syntax-based strategy, applied on uncleaned Web data, can still obtain good results. Moreover, we obtain much higher scores on "complex" questions, i.e. 'how' and 'why' questions, which are more representative of real user needs. These results show that questioning the Web with advanced linguistic techniques can be done without heavy pre-processing and with results that come near to best systems that use

@INPROCEEDINGS{Tannier2010a, 
  title = {{FIDJI: Web Question-Answering at Quaero 2009}},
  author = {Xavier Tannier and Véronique Moriceau},
  booktitle = {Proceedings of the Seventh International Language Resources and Evaluation (LREC'10)}, 
  address = {La Valette, Malta}, 
  year = {2010}, 
  month = may, 
  publisher = {ELRA}
}

Arnaud Grappy, Brigitte Grau, Olivier Ferret, Cyril Grouin, Véronique Moriceau, Isabelle Robba, Xavier Tannier, Anne Vilnat, Vincent Barbier.

A corpus for studying full answer justification.

in Proceedings of the Seventh International Language Resources and Evaluation (LREC'10). La Valette, Malta, May 2010. © ELRA.

ⓘ [abstract] [BibTeX] [free copy]

Question answering (QA) aims at retrieving precise information from a large collection of documents. To be considered as reliable by users, a QA system must provide elements to evaluate the answer. An answer justification can be found in a sentence, a passage made of several consecutive sentences or several passages of a document or several documents. Thus, we are interesting in pinpointing the set of information that allows to verify the correctness of the answer in a candidate passage and the question elements that are missing in this passage. The relevant information is often given in texts in a different form from the question form: anaphora, paraphrases, synonyms. In order to have a better idea of the importance of all the phenomena we underlined, and to provide enough examples at the QA developer's disposal to study them, we decided to build an annotated corpus.

@INPROCEEDINGS{Grappy2010, 
  title = {{A corpus for studying full answer justification}},
  author = {Arnaud Grappy and Brigitte Grau and Olivier Ferret and Cyril Grouin and Véronique Moriceau and Isabelle Robba and Xavier Tannier and Anne Vilnat and Vincent Barbier},
  booktitle = {Proceedings of the Seventh International Language Resources and Evaluation (LREC'10)}, 
  address = {La Valette, Malta}, 
  year = {2010}, 
  month = may, 
  publisher = {ELRA}
}

Ludovic Quintard, Olivier Galibert, Gilles Adda, Brigitte Grau, Dominique Laurent, Véronique Moriceau, Sophie Rosset, Xavier Tannier, Anne Vilnat.

Question Answering on web data: the QA evaluation in Quaero.

in Proceedings of the Seventh International Language Resources and Evaluation (LREC'10). La Valette, Malta, May 2010. © ELRA.

ⓘ [abstract] [BibTeX] [free copy]

In the QA and information retrieval domains progress has been assessed via evaluation campaigns(Clef, Ntcir, Equer, Trec).In these evaluations, the systems handle independent questions and should provide one answer to each question, extracted from textual data, for both open domain and restricted domain. Quaero is a program promoting research and industrial innovation on technologies for automatic analysis and classification of multimedia and multilingual documents. Among the many research areas concerned by Quaero. The Quaero project organized a series of evaluations of Question Answering on Web Data systems in 2008 and 2009. For each language, English and French the full corpus has a size of around 20Gb for 2.5M documents. We describe the task and corpora, and especially the methodologies used in 2008 to construct the test of question and a new one in the 2009 campaign. Six types of questions were addressed, factual, Non-factual(How, Why, What), List, Boolean. A description of the participating systems and the obtained results is provided. We show the difficulty for a question-answering system to work with complex data and questions.

@INPROCEEDINGS{Quintard2010, 
  title = {{Question Answering on web data: the QA evaluation in Quaero}},
  author = {Ludovic Quintard and Olivier Galibert and Gilles Adda and Brigitte Grau and Dominique Laurent and Véronique Moriceau and Sophie Rosset and Xavier Tannier and Anne Vilnat},
  booktitle = {Proceedings of the Seventh International Language Resources and Evaluation (LREC'10)}, 
  address = {La Valette, Malta}, 
  year = {2010}, 
  month = may, 
  publisher = {ELRA}
}

Xavier Tannier, Véronique Moriceau.

Studying Syntactic Analysis in a QA System: FIDJI@ResPubliQA'09.

in Multilingual Information Access Evaluation I. Text Retrieval Experiments. Padua, Italy, pages 237-244, 2010. © Springer. (LNCS 6241).

ⓘ [abstract] [BibTeX] [SpringerLink]

FIDJI is an open-domain question-answering system for French. The main goal is to validate answers by checking that all the information given in the question is retrieved in the supporting texts. This paper presents FIDJI's results at ResPubliQA 2009, as well as additional experiments bringing to light the role of linguistic modules in this particular campaign.

@InProceedings{Tannier2010d, 
  title = {{Studying Syntactic Analysis in a QA System: FIDJI@ResPubliQA'09}},
  author = {Xavier Tannier and Véronique Moriceau},
  booktitle = {Multilingual Information Access Evaluation I. Text Retrieval Experiments}, 
  address = {Padua, Italy}, 
  year = {2010}, 
  publisher = {Springer}, 
  volume = {LNCS 6241}, 
  pages = {237-244}
}

Véronique Moriceau, Éric SanJuan, Xavier Tannier.

QA@INEX 2009: A common task for QA, focused IR and automatic summarization systems.

in Pre-proceedings of the INitiative for the Evaluation of XML retrieval workshop (INEX 2009). Brisbane, Australia, pages 334-338, 2009.

ⓘ [abstract] [BibTeX] [ps.gz] [pdf]

QA@INEX aims to evaluate a complex question-answering task. In such a task, the set of questions is composed of factoid, precise questions that expect short answers, as well as more complex questions that can be answered by several sentences or by an aggregation of texts from different documents. Question-answering, XML/passage retrieval and automatic summarization are combined in order to get closer to real information needs.

@InProceedings{Moriceau2009c, 
  title = {{QA@INEX 2009: A common task for QA, focused IR and automatic summarization systems}},
  author = {Véronique Moriceau and \'Eric SanJuan and Xavier Tannier},
  booktitle = {Pre-proceedings of the INitiative for the Evaluation of XML retrieval workshop (INEX 2009)}, 
  address = {Brisbane, Australia}, 
  year = {2009}, 
  editor = {Shlomo Geva and Jaap Kamps and Andrew Trotman}, 
  pages = {334-338}
}

Xavier Tannier, Véronique Moriceau.

FIDJI in ResPubliQA 2009.

in Working Notes for the CLEF 2009 Workshop. Corfu, Greece, September 2009.

ⓘ [abstract] [BibTeX] [free copy]

We present FIDJI's results in question-answering campaign ResPubliQA 2009 for French. In this task, systems receive 500 independent questions in natural language as input, and must return one paragraph containing the answer from the document collection. No exact answer is required neither multiple responses. The document collection is JRC-Acquis about EU documentation.
Our answer validation approach assumes that the different entities of the question can be retrieved, properly connected, either in a sentence, in a passage or in multiple documents. FIDJI has to detect syntactic implications between questions and passages containing the answers. Our system relies on syntactic analysis provided by XIP, which is used to parse both the questions and the documents from which answers are extracted. We designed the system so that no particular linguistic-oriented pre-processing is needed, and as few semantic resources as possible.
Given the differences between ResPubliQA and more traditional question-answering campaigns, our aim was to estimate whether using syntactic analysis was as useful in this context as it proved to be in more focused QA. We obtained 30% of correct answers, with good scores for complex questions ('how', 'why') but lower than usual for factual and definition questions.

@INPROCEEDINGS{Tannier09, 
  title = {{FIDJI in ResPubliQA 2009}},
  author = {Xavier Tannier and Véronique Moriceau},
  booktitle = {Working Notes for the CLEF 2009 Workshop}, 
  address = {Corfu, Greece}, 
  year = {2009}, 
  month = sep
}

Véronique Moriceau, Xavier Tannier.

Étude de l'apport de la syntaxe dans un système de question-réponse.

in Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2009, poster). Senlis, France, June 2009.

ⓘ [abstract] [BibTeX] [free copy]

This paper presents some experiments aiming at estimating the contribution of a syntactic parser on both questions and documents in a question-answering system. This evaluation has been performed with the system FIDJI, which makes use of both syntactic information and more "traditional" techniques.
Document selection, answer extraction as well as system behaviour on different types of questions have been experimented.

Cet article présente une série d'évaluations visant à étudier l'apport d'une analyse syntaxique robuste des questions et des documents dans un système de questions-réponses. Ces évaluations ont été effectuées sur le système FIDJI, qui utilise à la fois des informations syntaxiques et des techniques plus "traditionnelles".
La sélection des documents, l'extraction de la réponse ainsi que le comportement selon les différents types de questions ont été étudiés.

@INPROCEEDINGS{Moriceau09b, 
  title = {{Étude de l'apport de la syntaxe dans un système de question-réponse}},
  author = {Véronique Moriceau and Xavier Tannier},
  booktitle = {Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2009, poster)}, 
  address = {Senlis, France}, 
  year = {2009}, 
  month = jun
}

Véronique Moriceau, Xavier Tannier, Brigitte Grau.

Utilisation de la syntaxe pour valider les réponses à des questions par plusieurs documents.

in Actes de la 6ème conférence en Recherche d'Information et Applications (CORIA 09). Presqu'île de Giens, France, May 2009.

ⓘ [abstract] [BibTeX] [Proceedings] [pdf] [ps.gz]

This article presents FIDJI, a question-answering system for French, combining syntactic information with traditional QA techniques such as named entity recognition and term weighting. Among other uses of syntax, we experiment in this system the validation of answers through different documents, as well as specific techniques for answering different types of questions (e.g. yes/no or list questions.

Cet article présente FIDJI, un système de questions-réponses pour le français, combinant des informations syntaxiques sur la question et les documents avec des techniques plus traditionnelles du domaine, telles que la reconnaissance des entités nommées et la pondération des termes. Notamment, nous expérimentons dans ce système la validation des réponses dans plusieurs documents, ainsi que des techniques spécifiques permettant de répondre à différents types de questions (comme les questions attendant des réponses multiples (liste) ou une réponse booléenne).

@InProceedings{Moriceau09a, 
  title = {{Utilisation de la syntaxe pour valider les réponses à des questions par plusieurs documents}},
  author = {Véronique Moriceau and Xavier Tannier and Brigitte Grau},
  booktitle = {Actes de la 6ème conférence en Recherche d'Information et Applications (CORIA 09)}, 
  address = {Presqu'île de Giens, France}, 
  year = {2009}, 
  month = may
}

Véronique Moriceau, Xavier Tannier, Arnaud Grappy, Brigitte Grau.

Justification of Answers by Verification of Dependency Relations - The French AVE Task.

in Working Notes of CLEF Workshop. Aarhus, Denmark, 2008.

ⓘ [abstract] [BibTeX] [free copy]

This paper presents LIMSI results in Answer Validation Exercise (AVE) 2008 for French. In this task, systems have to consider triplets (question, answer, supporting text) and decide whether the answer to the question is correct and supported or not according to the given supporting text.

We tested two approaches during this campaign:

A syntax-based strategy, where the system decides whether the supporting text is a reformulation of the question.
A machine learning strategy, where several features are combined in order to validate answers: presence of common words in the question and in the text, word distance, etc.

The first system, called FIDJI, uses a syntactic parser on questions and provided passages. The approach is to detect, for a given tuple question/answer/supporting text, if all the characteristics of the question can be retrieved in the text. As in other works, some rewriting rules have been set up in order to account for syntactic variations such as passive/active voice, nominalization of verbs, appositions, coordinations, etc. Documents are also tagged with named entity types; Combined with the analysis of the question, this can be used to check that the answer corresponds to the expected type. A few heuristics are then applied to validate the answer.

The second strategy follows a machine learning approach and applies the question-answering system FRASQUES in order to compute some of the learning features. The learning set is extracted from the data provided by AVE 2006 and contains 75% of the total data. The chosen classifier is a combination of decision trees with the bagging method. It is provided by the WEKA program that allows to test a lot of classifiers. Features are terms in common between the passage and the answer (and especially the focus (main word), the answer type, the main verb and bi-terms), the answer given by our existing system FRASQUES, the longuest common chain of words, the answer type checking with Wikipedia, as well as answers given by FIDJI system.

The first system leads to a very good precision (88%) but a quite low recall (42%), while the second one improves recall and reaches a F-measure of 61%. These results must be put into perspective because of the low number of answers, and especially positive answers, provided by AVE for French this year.

@InProceedings{Tannier08d, 
  title = {{Justification of Answers by Verification of Dependency Relations - The French AVE Task}},
  author = {Véronique Moriceau and Xavier Tannier and Arnaud Grappy and Brigitte Grau},
  booktitle = {Working Notes of CLEF Workshop}, 
  address = {Aarhus, Denmark}, 
  year = {2008}
}

Chyrine Tahri, Aurore Bochnakian, Patrick Haouat, Xavier Tannier.

Transitioning from benchmarks to a real-world case of information-seeking in Scientific Publications.

in Findings of the Association for Computational Linguistics: ACL 2023. Toronto, Canada, pages 1066–1076, July 2023. © Association for Computational Linguistics.

ⓘ [abstract] [BibTeX] [ACL Anthology]

Although recent years have been marked by incredible advances in the whole development process of NLP systems, there are still blind spots in characterizing what is still hampering real-world adoption of models in knowledge-intensive settings. In this paper, we illustrate through a real-world zero-shot text search case for information seeking in scientific papers, the masked phenomena that the current process of measuring performance might not reflect, even when benchmarks are, in appearance, faithfully representative of the task at hand. In addition to experimenting with TREC-COVID and NFCorpus, we provide an industrial, expert-carried/annotated, case of studying vitamin B’s impact on health. We thus discuss the misalignment between solely focusing on single-metric performance as a criterion for model choice and relevancy as a subjective measure for meeting a user’s need.

@InProceedings{Tahri2023, 
  title = {{Transitioning from benchmarks to a real-world case of information-seeking in Scientific Publications}},
  author = {Tahri, Chyrine and Bochnakian, Aurore and Haouat, Patrick and Tannier, Xavier},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2023}, 
  address = {Toronto, Canada}, 
  year = {2023}, 
  month = jul, 
  publisher = {Association for Computational Linguistics}, 
  pages = {1066–1076}
}

Patrice Bellot, Toine Bogers, Shlomo Geva, Mark Hall, Hugo Huurdeman, Jaap Kamps, Gabriella Kazai, Marijn Koolen, Véronique Moriceau, Josiane Mothe, Michael Preminger, Eric SanJuan, Ralf Schenkel, Mette Skov, Xavier Tannier, David Walsh.

Overview of INEX 2014.

in Information Access Evaluation. Multilinguality, Multimodality, and Interaction, 5th International Conference of the CLEF Initiative, CLEF 2014. pages 212-228, 2014. © Springer. (LNCS 8685).

ⓘ [abstract] [BibTeX] [SpringerLink]

INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2014 evaluation campaign, which consisted of three tracks: The Interactive Social Book Search Track investigated user information seeking behavior when interacting with various sources of information, for realistic task scenarios, and how the user interface impacts search and the search experience. The Social Book Search Track investigated the relative value of authoritative metadata and user-generated content for search and recommendation using a test collection with data from Amazon and LibraryThing, including user profiles and personal catalogues. The Tweet Contextualization Track investigated tweet contextualization, helping a user to understand a tweet by providing him with a short background summary generated from relevant Wikipedia passages aggregated into a coherent summary. INEX 2014 was an exciting year for INEX in which we for the third time ran our workshop as part of the CLEF labs. This paper gives an overview of all the INEX 2014 tracks, their aims and task, the built test-collections, the participants, and gives an initial analysis of the results.

@InProceedings{Bellot2014, 
  title = {{Overview of INEX 2014}},
  author = {Patrice Bellot and Toine Bogers and Shlomo Geva and Mark Hall and Hugo Huurdeman and Jaap Kamps and Gabriella Kazai and Marijn Koolen and Véronique Moriceau and Josiane Mothe and Michael Preminger and Eric SanJuan and Ralf Schenkel and Mette Skov and Xavier Tannier and David Walsh},
  booktitle = {Information Access Evaluation. Multilinguality, Multimodality, and Interaction, 5th International Conference of the CLEF Initiative, CLEF 2014}, 
  year = {2014}, 
  publisher = {Springer}, 
  volume = {LNCS 8685}, 
  pages = {212-228}
}

Patrice Bellot, Véronique Moriceau, Josiane Mothe, Eric SanJuan, Xavier Tannier.

Évaluation de la contextualisation de tweets.

in Actes de la COnférence en Recherche d'Information et ses Applications (CORIA 2013, article court). Neuchâtel, Suisse, April 2013.

ⓘ [abstract] [BibTeX] [free copy]

This paper deals with tweet contextualization evaluation. Text contextualization is defined as providing the reader with a summary allowing a reader to understand a short text that, because of its size is not self-contained. A general evaluation framework for tweet contextualization or other type of short texts is defined. We propose a collection benchmark as well as the appropriate evaluation measures. This framework has been experimented in INEX Tweet Contextualisation track. Based on the track results, we discuss these measures with regards to other measures from the litterature.

Cet article s'intéresse à l'évaluation de la contextualisation de tweets. La contextualisation est définie comme un résumé permettant de remettre en contexte un texte qui, de par sa taille, ne contient pas l'ensemble des éléments qui permettent à un lecteur de comprendre tout ou partie de son contenu. Nous définissons un cadre d'évaluation pour la contextualisation de tweets généralisable d'autres textes courts. Nous proposons une collection de référence ainsi que des mesures d'évaluation adhoc. Ce cadre d'évaluation a été expérimenté avec succès dans la contexte de la campagne INEX Tweet Contextualization. Au regard des résultats obtenus lors de cette campagne, nous discutons ici les mesures utilisées en lien avec les autres mesures de la littérature.

@InProceedings{Bellot2013a, 
  title = {{\'Evaluation de la contextualisation de tweets}},
  author = {Patrice Bellot and Véronique Moriceau and Josiane Mothe and Eric SanJuan and Xavier Tannier},
  booktitle = {Actes de la COnférence en Recherche d'Information et ses Applications (CORIA 2013, article court)}, 
  address = {Neuchâtel, Suisse}, 
  year = {2013}, 
  month = apr
}

Patrice Bellot, Véronique Moriceau, Josiane Mothe, Éric SanJuan, Xavier Tannier.

Overview of the INEX Tweet Contextualization 2013 Track.

in Working Notes for the CLEF 2013 Workshop. Valencia (Spain), September 2013.

ⓘ [abstract] [BibTeX] [free copy]

Twitter is increasingly used for on-line client and audience fishing, this motivated the tweet contextualization task at INEX. The objective is to help a user to understand a tweet by providing him with a short summary (500 words). This summary should be built automatically using local resources like the Wikipedia and generated by extracting relevant passages and aggregating them into a coherent summary. The task is evaluated considering informativeness which is computed using a variant of Kullback-Leibler divergence and passage pooling. Meanwhile effective readability in context of summaries is checked using binary questionnaires on small samples of results. Running since 2010, results show that only systems that efficiently combine passage retrieval, sentence segmentation and scoring, named entity recognition, text POS analysis, anaphora detection, diversity content measure as well as sentence reordering are effective.

@InProceedings{Bellot2013b, 
  title = {{Overview of the INEX Tweet Contextualization 2013 Track}},
  author = {Patrice Bellot and Véronique Moriceau and Josiane Mothe and \'Eric SanJuan and Xavier Tannier},
  booktitle = {Working Notes for the CLEF 2013 Workshop}, 
  address = {Valencia (Spain)}, 
  year = {2013}, 
  month = sep
}

Patrice Bellot, Antoine Doucet, Shlomo Geva, Sairam Gurajada, Jaap Kamps, Gabriella Kazai, Marijn Koolen, Arunav Mishra, Véronique Moriceau, Josiane Mothe, Michael Preminger, Eric SanJuan, Ralf Schenkel, Xavier Tannier, Martin Theobald, Matthew Trappett, Qiuyue Wang.

Overview of INEX 2013.

in Information Access Evaluation. Multilinguality, Multimodality, and Interaction, 4th International Conference of the CLEF Initiative, CLEF 2013. pages 269-281, 2013. © Springer. (LNCS 8138).

ⓘ [abstract] [BibTeX] [SpringerLink]

INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2013 evaluation campaign, which consisted of four activities addressing three themes: searching professional and user generated data (Social Book Search track); searching structured or semantic data (Linked Data track); and focused retrieval (Snippet Retrieval and Tweet Contextualization tracks). INEX 2013 was an exciting year for INEX in which we consolidated the collaboration with (other activities in) CLEF and for the second time ran our workshop as part of the CLEF labs in order to facilitate knowledge transfer between the evaluation forums. This paper gives an overview of all the INEX 2013 tracks, their aims and task, the built test-collections, and gives an initial analysis of the results.

@InProceedings{Bellot2013d, 
  title = {{Overview of INEX 2013}},
  author = {Patrice Bellot and Antoine Doucet and Shlomo Geva and Sairam Gurajada and Jaap Kamps and Gabriella Kazai and Marijn Koolen and Arunav Mishra and Véronique Moriceau and Josiane Mothe and Michael Preminger and Eric SanJuan and Ralf Schenkel and Xavier Tannier and Martin Theobald and Matthew Trappett and Qiuyue Wang},
  booktitle = {Information Access Evaluation. Multilinguality, Multimodality, and Interaction, 4th International Conference of the CLEF Initiative, CLEF 2013}, 
  year = {2013}, 
  publisher = {Springer}, 
  volume = {LNCS 8138}, 
  pages = {269-281}
}

P. Bellot, G. Kazai, M. Preminger, M. Trappett, A. Doucet, M. Koolen, E. SanJuan, A. Trotman, S. Geva, A. Mishra, R. Schenkel, M. Sanderson, S. Gurajada, V. Moriceau, X. Tannier, F. Scholer, J. Kamps, J. Mothe, M. Theobald, Q. Wang.

Report on INEX 2013.

SIGIR Forum. Vol. 47, Issue 2, December 2013.

ⓘ [abstract] [BibTeX] [free copy]

INEX investigates focused retrieval from structured documents by providing large testcollections of structured documents, uniform evaluation measures,and a forum for organiza-tions to compare their results. This paper reports on the INEX 2013 evaluation campaign,which consisted of four activities addressing three themes:searching professional and usergenerated data(Social Book Search track);searching structured or semantic data(LinkedData track); andfocused retrieval(Snippet Retrieval and Tweet Contextualization tracks).INEX 2013 was an exciting year for INEX in which we consolidated the collaboration with(other activities in) CLEF and for the second time ran our workshop as part ofthe CLEFlabs in order to facilitate knowledge transfer between the evaluationforums. This papergives an overview of all the INEX 2013 tracks, their aims and task, the builttest-collections,and gives an initial analysis of the results.

@Article{Bellot2013c, 
  title = {{Report on INEX 2013}},
  author = {P. Bellot and G. Kazai and M. Preminger and M. Trappett and A. Doucet and M. Koolen and E. SanJuan and A. Trotman and S. Geva and A. Mishra and R. Schenkel and M. Sanderson and S. Gurajada and V. Moriceau and X. Tannier and F. Scholer and J. Kamps and J. Mothe and M. Theobald and Q. Wang},
  number = {2}, 
  year = {2013}, 
  month = dec, 
  journal = {SIGIR Forum}, 
  volume = {47}
}

Éric SanJuan, Véronique Moriceau, Xavier Tannier, Patrice Bellot, Josiane Mothe.

Overview of the INEX 2011 Question Answering Track (QA@INEX).

in Focused Retrieval of Content and Structure, 10th International Workshop of the Inititative for the Evaluation of XML Retrieval, INEX 2011. pages 269-281, 2012. Lecture Notes in Computer Science (LNCS 7424).

ⓘ [BibTeX] [Ask me!] [SpringerLink]

In this article, we present a novel graph-based approach for pseudo-relevance feedback. We model term co-occurences in a fixed window or at the document level as a graph and apply a random walk algorithm to select expansion terms. Evaluation of the proposed approach on several standard TREC and CLEF collections including the recent TREC-Microblog dataset show that the proposed approach is competitive with state-of-the-art pseudo-relevance feedback models.

@InProceedings{SanJuan2012b, 
  title = {{Overview of the INEX 2011 Question Answering Track (QA@INEX)}},
  author = {\'Eric SanJuan and Véronique Moriceau and Xavier Tannier and Patrice Bellot and Josiane Mothe},
  booktitle = {Focused Retrieval of Content and Structure, 10th International Workshop of the Inititative for the Evaluation of XML Retrieval, INEX 2011}, 
  year = {2012}, 
  series = {Lecture Notes in Computer Science}, 
  volume = {LNCS 7424}, 
  editor = {Shlomo Geva and Jaap Kamps and Ralf Schenkel}, 
  pages = {269-281}
}

Éric SanJuan, Véronique Moriceau, Xavier Tannier, Patrice Bellot, Josiane Mothe.

Overview of the INEX 2012 Tweet Contextualization Track.

in Working Notes for the CLEF 2012 Workshop. Rome (Italy), September 2012.

ⓘ [abstract] [BibTeX] [free copy]

The use case of the Tweet Contextualization task is the following: given a new tweet, participating systems must provide some context about the subject of a tweet, in order to help the reader to understand it. In this task, contextualizing tweets consists in answering questions of the form "what is this tweet about?" which can be answered by several sentences or by an aggregation of texts from different documents of the Wikipedia. Thus, tweet analysis, XML/passage retrieval and automatic summarization are combined in order to get closer to real information needs. This article describes the data sets and topics, the metrics used for the evaluation of the systems submissions, as well as the results that they obtained.

@InProceedings{SanJuan2012a, 
  title = {{Overview of the INEX 2012 Tweet Contextualization Track}},
  author = {\'Eric SanJuan and Véronique Moriceau and Xavier Tannier and Patrice Bellot and Josiane Mothe},
  booktitle = {Working Notes for the CLEF 2012 Workshop}, 
  address = {Rome (Italy)}, 
  year = {2012}, 
  month = sep
}

P. Bellot, T. Chappell, A. Doucet, S. Geva , S. Gurajada, J. Kamps, G. Kazai, M. Koolen, M. Landoni, M. Marx, A. Mishra, V. Moriceau, J. Mothe, M. Preminger, G. Ramírez, M. Sanderson, E. Sanjuan, F. Scholer, X. Tannier, M. Theobald, M. Trappett, A. Trotman, Q. Wang.

Report on INEX 2012.

SIGIR Forum. Vol. 46, Issue 2, pages 50-59, December 2012.

ⓘ [abstract] [BibTeX] [free copy]

INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX'12 evaluation campaign, which consisted of a five tracks: Linked Data, Relevance Feedback, Snippet Retrieval, Social Book Search, and Tweet Contextualization. INEX'12 was an exciting year for INEX in which we joined forces with CLEF and for the first time ran our workshop as part of the CLEF labs in order to facilitate knowledge transfer between the evaluation forums. INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2011 evaluation campaign, which consisted of a five active tracks: Books and Social Search, Data Centric, Question Answering, Relevance Feedback, and Snippet Retrieval. INEX 2011 saw a range of new tasks and tracks, such as Social Book Search, Faceted Search, Snippet Retrieval, and Tweet Contextualization.

@Article{Bellot12b, 
  title = {{Report on INEX 2012}},
  author = {P. Bellot and T. Chappell and A. Doucet and S. Geva  and S. Gurajada and J. Kamps and  G. Kazai and M. Koolen and M. Landoni and M. Marx and A. Mishra and V. Moriceau and J. Mothe and M. Preminger and G. Ramírez and M. Sanderson and E. Sanjuan and F. Scholer and X. Tannier and M. Theobald and M. Trappett and A. Trotman and Q. Wang},
  number = {2}, 
  year = {2012}, 
  month = dec, 
  journal = {SIGIR Forum}, 
  volume = {46}, 
  pages = {50-59}
}

P. Bellot, T. Chappell, A. Doucet, S. Geva , J. Kamps, G. Kazai, M. Koolen, M. Landoni, M. Marx, V. Moriceau, J. Mothe, G. Ramírez, M. Sanderson, E. Sanjuan, F. Scholer, X. Tannier, M. Theobald, M. Trappett, A. Trotman, Q. Wang.

Report on INEX 2011.

SIGIR Forum. Vol. 46, Issue 1, pages 33-42, June 2012.

ⓘ [abstract] [BibTeX] [free copy]

INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2011 evaluation campaign, which consisted of a five active tracks: Books and Social Search, Data Centric, Question Answering, Relevance Feedback, and Snippet Retrieval. INEX 2011 saw a range of new tasks and tracks, such as Social Book Search, Faceted Search, Snippet Retrieval, and Tweet Contextualization. INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2011 evaluation campaign, which consisted of a five active tracks: Books and Social Search, Data Centric, Question Answering, Relevance Feedback, and Snippet Retrieval. INEX 2011 saw a range of new tasks and tracks, such as Social Book Search, Faceted Search, Snippet Retrieval, and Tweet Contextualization.

@Article{Bellot12a, 
  title = {{Report on INEX 2011}},
  author = {P. Bellot and T. Chappell and A. Doucet and S. Geva  and J. Kamps and  G. Kazai and M. Koolen and M. Landoni and M. Marx and V. Moriceau and J. Mothe and G. Ramírez and M. Sanderson and E. Sanjuan and F. Scholer and X. Tannier and M. Theobald and M. Trappett and A. Trotman and Q. Wang},
  number = {1}, 
  year = {2012}, 
  month = jun, 
  journal = {SIGIR Forum}, 
  volume = {46}, 
  pages = {33-42}
}

Patrick Paroubek, Xavier Tannier.

A Rough Set Formalization of Quantitative Evaluation with Ambiguity.

in Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012). Istanbul, Turkey, May 2012.

ⓘ [abstract] [BibTeX] [Poster] [free copy]

In this paper, we present the founding elements of a formal model of the evaluation paradigm in natural language processing. We propose an abstract model of objective quantitative evaluation based on rough sets, as well as the notion of potential performance space for describing the performance variations corresponding to the ambiguity present in hypothesis data produced by a computer program, when comparing it to the reference data created by humans. A formal model of the evaluation paradigm will be useful for comparing evaluations protocols, investigating evaluation constraint relaxation and getting a better understanding of the evaluation paradigm, provided it is general enough to be able to represent any natural language processing task.

@InProceedings{Paroubek2012, 
  title = {{A Rough Set Formalization of Quantitative Evaluation with Ambiguity}},
  author = {Patrick Paroubek and Xavier Tannier},
  booktitle = {Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)}, 
  address = {Istanbul, Turkey}, 
  year = {2012}, 
  month = may
}

Éric SanJuan, Véronique Moriceau, Xavier Tannier, Patrice Bellot, Josiane Mothe.

Overview of the INEX 2011 Question Answering Track (QA@INEX).

in Pre-proceedings of the INitiative for the Evaluation of XML retrieval workshop (INEX 2011). Saarbrücken (Germany), pages 145-153, December 2011.

ⓘ [abstract] [BibTeX] [Proceedings]

The INEX QA track (QA@INEX) aims to evaluate a complex question-answering task. In such a task, the set of questions is composed of complex questions that can be answered by several sentences or by an aggregation of texts from different documents. Question-answering, XML/passage retrieval and automatic summarization are combined in order to get closer to real information needs. Based on the groundwork carried out in 2009-2010 edition to determine the sub-tasks and a novel evaluation methodology, the 2011 edition of the track is contextualizing tweets using a recent cleaned dump of the Wikipedia.

@InProceedings{SanJuan2011b, 
  title = {{Overview of the INEX 2011 Question Answering Track (QA@INEX)}},
  author = {\'Eric SanJuan and Véronique Moriceau and Xavier Tannier and Patrice Bellot and Josiane Mothe},
  booktitle = {Pre-proceedings of the INitiative for the Evaluation of XML retrieval workshop (INEX 2011)}, 
  address = {Saarbrücken (Germany)}, 
  year = {2011}, 
  month = dec, 
  pages = {145-153}
}

Éric SanJuan, Patrice Bellot, Véronique Moriceau, Xavier Tannier.

Overview of the INEX 2010 Question Answering Track (QA@INEX).

in Comparative Evaluation of Focused Retrieval, 9th International Workshop of the Inititative for the Evaluation of XML Retrieval, INEX 2010. Vugh, The Netherlands, pages 269-281, 2011. Lecture Notes in Computer Science (LNCS 6932).

ⓘ [BibTeX] [SpringerLink] [Ask me!]

In this article, we present an automatic question generation system for French. The system proceeds by transforming declarative sentences into interrogative sentences, based on a preliminary syntactic analysis of the base sentence. We detail the different types of questions generated. We also present an evaluation of the tool, which shows that 41% of the questions generated by the system are perfectly well-formed.

Nous présentons dans cet article un générateur automatique de questions pour le français. Le système de génération procède par transformation de phrases déclaratives en interrogatives et se base sur une analyse syntaxique préalable de la phrase de base. Nous détaillons les différents types de questions générées. Nous présentons également une évaluation de l'outil, qui démontre que 41 % des questions générées par le système sont parfaitement bien formées.

@InProceedings{SanJuan2011a, 
  title = {{Overview of the INEX 2010 Question Answering Track (QA@INEX)}},
  author = {\'Eric SanJuan and Patrice Bellot and Véronique Moriceau and Xavier Tannier},
  booktitle = {Comparative Evaluation of Focused Retrieval, 9th International Workshop of the Inititative for the Evaluation of XML Retrieval, INEX 2010}, 
  address = {Vugh, The Netherlands}, 
  year = {2011}, 
  series = {Lecture Notes in Computer Science}, 
  volume = {LNCS 6932}, 
  editor = {Shlomo Geva and Jaap Kamps and Ralf Schenkel and Andrew Trotman}, 
  pages = {269-281}
}

D. Alexander, P. Arvola, T. Beckers, P. Bellot, T. Chappell, C.M. De Vries, A. Doucet, N. Fuhr, S. Geva, J. Kamps, G. Kazai, M. Koolen, S. Kutty, M. Landoni, V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel, A. Tagarelli, X. Tannier, J.A. Thom, A. Trotman, J. Vaino, Q. Wang, C. Wu.

Report on INEX 2010.

SIGIR Forum. Vol. 45, Issue 1, pages 2-17, June 2011.

ⓘ [abstract] [BibTeX] [free copy]

INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2010 evaluation campaign, which consisted of a wide range of tracks: Ad Hoc, Book, Data Centric, Interactive, QA, Link the Wiki, Relevance Feedback, Web Service Discovery and XML Mining.

@Article{Alexander11, 
  title = {{Report on INEX 2010}},
  author = {D. Alexander and P. Arvola and T. Beckers and P. Bellot and T. Chappell and C.M. De Vries and A. Doucet and N. Fuhr and S. Geva and J. Kamps and G. Kazai and M. Koolen and S. Kutty and M. Landoni and V. Moriceau and R. Nayak and R. Nordlie and N. Pharo and E. SanJuan and R. Schenkel and A. Tagarelli and X. Tannier and J.A. Thom and A. Trotman and J. Vaino and Q. Wang and C. Wu},
  number = {1}, 
  year = {2011}, 
  month = jun, 
  journal = {SIGIR Forum}, 
  volume = {45}, 
  pages = {2-17}
}

Éric SanJuan, Patrice Bellot, Véronique Moriceau, Xavier Tannier.

Overview of the 2010 QA Track: Preliminary results.

in Pre-proceedings of the INitiative for the Evaluation of XML retrieval workshop (INEX 2010). Vught (Netherlands), pages 209-213, December 2010.

ⓘ [abstract] [BibTeX] [Proceedings]

The INEX QA track (QA@INEX) in 2009 - 2010 aims to evaluate a complex question-answering task using the Wikipedia. The set of questions is composed of factoid, precise questions that expect short answers, as well as more complex questions that can be answered by several sentences or by an aggregation of texts from different documents. This overview is centered on the long type answer QA@INEX sub track. The evaluation methodology based on word distribution divergence has allowed several summarization systems to participate. Lots of these sys- tems generated a readable extract of sentences from top ranked docu- ments by a state-of-the-art method. Some of the participants also tested several methods of question disambiguation. They have been evaluated on a pool of real questions from Nomao and Yahoo! Answers. Manual evaluation, as well as short type question task, are still running.

@InProceedings{SanJuan2010, 
  title = {{Overview of the 2010 QA Track: Preliminary results}},
  author = {\'Eric SanJuan and Patrice Bellot and Véronique Moriceau and Xavier Tannier},
  booktitle = {Pre-proceedings of the INitiative for the Evaluation of XML retrieval workshop (INEX 2010)}, 
  address = {Vught (Netherlands)}, 
  year = {2010}, 
  month = dec, 
  pages = {209-213}
}

Véronique Moriceau, Éric SanJuan, Xavier Tannier, Patrice Bellot.

Overview of the 2009 QA Track: Towards a Common Task for QA, Focused IR and Automatic Summarization Systems.

in Focused Retrieval and Evaluation, 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2009. Brisbane, Australia, pages 355-365, 2010. © Springer Verlag. Lecture Notes in Computer Science (LNCS 6203).

ⓘ [abstract] [BibTeX] [Ask me!] [SpringerLink]

QA@INEX aims to evaluate a complex question-answering task. In such a task, the set of questions is composed of factoid, precise questions that expect short answers, as well as more complex questions that can be answered by several sentences or by an aggregation of texts from different documents. Question-answering, XML/passage retrieval and automatic summarization are combined in order to get closer to real information needs. This paper presents the groundwork carried out in 2009 to determine the tasks and a novel evaluation methodology that will be used in 2010.

@InProceedings{Moriceau2010b, 
  title = {{Overview of the 2009 QA Track: Towards a Common Task for QA, Focused IR and Automatic Summarization Systems}},
  author = {Véronique Moriceau and \'Eric SanJuan and Xavier Tannier and Patrice Bellot},
  booktitle = {Focused Retrieval and Evaluation, 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2009}, 
  address = {Brisbane, Australia}, 
  year = {2010}, 
  publisher = {Springer Verlag}, 
  series = {Lecture Notes in Computer Science}, 
  volume = {LNCS 6203}, 
  pages = {355-365}
}

Thomas Beckers, Patrice Bellot, Gianluca Demartini, Ludovic Denoyer, Christopher M. De Vries, Antoine Doucet, Khairun Nisa Fachry, Norbert Fuhr, Patrick Gallinari, Shlomo Geva, Wei-Che Huang, Tereza Iofciu, Jaap Kamps, Gabriella Kazai, Marijn Koolen, Sangeetha Kutty, Monica Landoni, Miro Lehtonen, Véronique Moriceau, Richi Nayak, Ragnar Nordlie, Nils Pharo, Éric SanJuan, Ralf Schenkel, Xavier Tannier, Martin Theobald, James A. Thom, Andrew Trotman, Arjen P. de Vries.

Report on INEX 2009.

SIGIR Forum. Vol. 44, Issue 1, pages 38-57, 2010.

ⓘ [abstract] [BibTeX]

INEX investigates focused retrieval from structured documents by providing large test collections of structured documents, uniform evaluation measures, and a forum for organizations to compare their results. This paper reports on the INEX 2009 evaluation campaign, which consisted of a wide range of tracks: Ad hoc, Book, Efficiency, Entity Ranking, Interactive, QA, Link the Wiki, and XML Mining. INEX in running entirely on volunteer effort by the IR research community: anyone with an idea and some time to spend, can have a major impact!

@Article{Beckers2010, 
  title = {{Report on INEX 2009}},
  author = {Thomas Beckers and Patrice Bellot and Gianluca Demartini and Ludovic Denoyer and Christopher M. De Vries and Antoine Doucet and Khairun Nisa Fachry and Norbert Fuhr and Patrick Gallinari and Shlomo Geva and Wei-Che Huang and Tereza Iofciu and Jaap Kamps and Gabriella Kazai and Marijn Koolen and Sangeetha Kutty and Monica Landoni and Miro Lehtonen and Véronique Moriceau and Richi Nayak and Ragnar Nordlie and Nils Pharo and \'Eric SanJuan and Ralf Schenkel and Xavier Tannier and Martin Theobald and James A. Thom and Andrew Trotman and Arjen P. de Vries},
  number = {1}, 
  year = {2010}, 
  journal = {SIGIR Forum}, 
  volume = {44}, 
  pages = {38-57}
}

Véronique Moriceau, Éric SanJuan, Xavier Tannier.

QA@INEX 2009: A common task for QA, focused IR and automatic summarization systems.

in Pre-proceedings of the INitiative for the Evaluation of XML retrieval workshop (INEX 2009). Brisbane, Australia, pages 334-338, 2009.

ⓘ [abstract] [BibTeX] [ps.gz] [pdf]

QA@INEX aims to evaluate a complex question-answering task. In such a task, the set of questions is composed of factoid, precise questions that expect short answers, as well as more complex questions that can be answered by several sentences or by an aggregation of texts from different documents. Question-answering, XML/passage retrieval and automatic summarization are combined in order to get closer to real information needs.

@InProceedings{Moriceau2009c, 
  title = {{QA@INEX 2009: A common task for QA, focused IR and automatic summarization systems}},
  author = {Véronique Moriceau and \'Eric SanJuan and Xavier Tannier},
  booktitle = {Pre-proceedings of the INitiative for the Evaluation of XML retrieval workshop (INEX 2009)}, 
  address = {Brisbane, Australia}, 
  year = {2009}, 
  editor = {Shlomo Geva and Jaap Kamps and Andrew Trotman}, 
  pages = {334-338}
}

Xavier Tannier.

Extraction et recherche d'information en langage naturel dans les documents semi-structurés.

PhD Thesis, September 2006. Ecole Nationale Supérieure des Mines de Saint-Etienne.

ⓘ [abstract] [BibTeX] [ppt] [Amazon] [pdf] [ps.gz]

Information retrieval in semi-structured (practically written in XML) mixes aspects of traditional information retrieval and of database querying. The structure is very important, but the information need is vague. The retrieval unit can have different sizes (a paragraph, a figure, an entire article\dots). Furthermore, XML flexibility may create some breaks in the natural flow of the text.
Problems raised at this level are many, notably for document content analysis and querying. We studied the specific solutions that could bring the natural language processing (NLP) techniques. We proposed a theoretical frame and a practical approach to allow the use of traditional textual analysis techniques in XML documents, disregarding the structure. We also conceived an interface for querying XML documents in natural language, and proposed methods using the structure in order to improve the retrieval of relevant elements.

La recherche d'information (RI) dans des documents semi-structurés (écrits en XML en pratique) combine des aspects de la RI traditionnelle et ceux de l'interrogation de bases de données. La structure a une importance primordiale, mais le besoin d'information reste vague. L'unité de recherche est variable (un paragraphe, une figure, un article complet\dots). Par ailleurs, la flexibilité du langage XML autorise des manipulations du contenu qui provoquent parfois des ruptures arbitraires dans le flot naturel du texte.
Les problèmes posés par ces caractéristiques sont nombreux, que ce soit au niveau du pré-traitement des documents ou de leur interrogation. Face à ces problèmes, nous avons étudié les solutions spécifiques que pouvait apporter le traitement automatique de la langue (TAL). Nous avons ainsi proposé un cadre théorique et une approche pratique pour permettre l'utilisation des techniques d'analyse textuelle en faisant abstraction de la structure. Nous avons également conçu une interface d'interrogation en langage naturel pour la RI dans les documents XML, et proposé des méthodes tirant profit de la structure pour améliorer la recherche des éléments pertinents.

@PhDThesis{Tannier06d, 
  title = {{Extraction et recherche d'information en langage naturel dans les documents semi-structurés}},
  author = {Xavier Tannier},
  year = {2006}, 
  month = sep, 
  school = {Ecole Nationale Supérieure des Mines de Saint-Etienne}
}

Shlomo Geva, Marcus Hassler, Xavier Tannier.

XOR - XML Oriented Retrieval Language.

in Proceedings of ACM SIGIR 2006 Workshop on XML Element Retrieval Methodology. Seattle, WA, USA, pages 5-12, August 2006. © ACM Press, New York City, NY, USA.

ⓘ [abstract] [BibTeX] [pdf] [ps.gz]

The wide acceptance and rapidly growing use of XML as a standard storage and retrieval data format blurs the historical divide that exists between Information Retrieval and Database Retrieval. On the structured database retrieval side it is now possible to support highly structured access to documents using XML specific tools such as XPath, XQuery, XQL and more. On the information retrieval side it is possible to support access to the XML documents using XML specific retrieval query languages such as NEXI. None of the above are intended for end-users, but rather as enabling back-end technologies. In this paper we introduce XOR - a new XML Oriented Retrieval language that is designed to facilitate query specification with a strong IR flavour. XOR is backwards compatible with NEXI, but significantly extends its functionality overcoming many of its restrictions and limitations. While XOR itself is not an end-user tool, it is designed with the explicit goal of supporting IR, and more specifically, user oriented interfaces such as Natural Language Queries (NLQ) or interactive user interfaces. XOR provides the missing functionality that none of the existing XML retrieval tools support, and which advanced IR requires.

@InProceedings{Tannier06e, 
  title = {{XOR - XML Oriented Retrieval Language}},
  author = {Shlomo Geva and Marcus Hassler and Xavier Tannier},
  booktitle = {Proceedings of ACM SIGIR 2006 Workshop on XML Element Retrieval Methodology}, 
  address = {Seattle, WA, USA}, 
  year = {2006}, 
  month = aug, 
  publisher = {ACM Press, New York City, NY, USA}, 
  pages = {5-12}
}

Alan Woodley, Xavier Tannier, Marcus Hassler, Shlomo Geva.

Natural Language Processing and XML Retrieval.

in Proceedings of the Australasian Language Technology Workshop (ALTW 2006), short paper. Sydney, Australia, 2006.

ⓘ [abstract] [BibTeX] [pdf]

XML information retrieval (XML-IR) systems respond to user queries with results more specific than documents. XML-IR queries contain both bontent and structural requirements traditionally expressed in a formal language. However, an intuitive alternative is natural language queries (NLQs). Here, we discuss three approaches for handling NLQs in an XML-IR system that are comparable to, and even outperform formal language queries.

@InProceedings{Tannier06h, 
  title = {{Natural Language Processing and XML Retrieval}},
  author = {Alan Woodley and Xavier Tannier and Marcus Hassler and Shlomo Geva},
  booktitle = {Proceedings of the Australasian Language Technology Workshop (ALTW 2006), short paper}, 
  address = {Sydney, Australia}, 
  year = {2006}
}

Xavier Tannier.

From natural language to NEXI, an interface for INEX 2005 queries..

in Advances in XML Information Retrieval: Fourth Workshop of the Initiative for the Evaluation of XML retrieval (INEX). Schloss Dagstuhl, Germany, November 2006. © Springer Verlag. Lecture Notes in Computer Science (LNCS 3977).

ⓘ [abstract] [BibTeX] [ps.gz] [pdf]

Offering the possibility to query any XML retrieval system in natural language would be very helpful to a lot of users. In 2005, INEX proposed a framework to partipants that wanted to implement a natural language interface for the retrieval of XML documents, independantly of the search engine. This year the different teams participating to this task had wonderful results. This paper describes our contribution to this project and presents some opinions concerning the task.

@InProceedings{Tannier06a, 
  title = {{From natural language to NEXI, an interface for INEX 2005 queries.}},
  author = {Xavier Tannier},
  booktitle = {Advances in XML Information Retrieval: Fourth Workshop of the Initiative for the Evaluation of XML retrieval (INEX)}, 
  address = {Schloss Dagstuhl, Germany}, 
  year = {2006}, 
  month = nov, 
  publisher = {Springer Verlag}, 
  series = {Lecture Notes in Computer Science}, 
  volume = {LNCS 3977}, 
  editor = {Norbert Fuhr, Mounia Lalmas, Saadia Malik and Gabriella Kazai}
}

Xavier Tannier, Alan Woodley, Shlomo Geva, Marcus Hassler.

Approaches to Translating Natural Language Queries for use in XML Information Retrieval Systems.

Research Report 2006-400-008, July 2006. Ecole Nationale Supérieure des Mines de Saint-Etienne.

ⓘ [abstract] [BibTeX] [pdf]

XML information retrieval (XML-IR) systems aim to provide users with relevant results that are more specific than documents. To interact with XML-IR systems, users must express both their content and structural requirements in the form of a structured query, using formal languages such as XPath or NEXI. Natural language queries (NLQs) are a more intuitive alternative. Here, we present three approaches that analyse NLQs and translate them into a formal language (NEXI) query. The approaches participated in INEX's 2005 NLP track, where they performed strongly, even outperforming a baseline that consisted of manually constructed NEXI expressions. This suggests that further collaboration between NLP and XML-IR could be mutually beneficial.

@TechReport{Tannier06g, 
  title = {{Approaches to Translating Natural Language Queries for use in XML Information Retrieval Systems}},
  author = {Xavier Tannier and Alan Woodley and Shlomo Geva and Marcus Hassler},
  number = {2006-400-008}, 
  year = {2006}, 
  month = jul, 
  institution = {Ecole Nationale Supérieure des Mines de Saint-Etienne}
}

Xavier Tannier.

Recherche d'information dans les documents XML.

Research Report 2006-400-007, June 2006. Ecole Nationale Supérieure des Mines de Saint-Etienne.

ⓘ [abstract] [BibTeX] [pdf]

Ce document est la version longue d'un chapitre d'état de l'art d'une thèse concernant l'extraction et la recherche d'information en langage naturel dans les documents structurés. Nous avons pensé qu'il pouvait être utile en lui-même aux personnes désireuses de se familiariser avec les problématiques de la recherche d'information dans les documents semi-structurés de type XML.
Cet état de l'art n'est donc en aucun cas une description exhaustive des applications des techniques de recherche d'information, mais une introduction à certaines problématiques, choisies au départ pour correspondre avec le sujet général de la thèse présentée.

@TechReport{Tannier06b, 
  title = {{Recherche d'information dans les documents XML}},
  author = {Xavier Tannier},
  number = {2006-400-007}, 
  year = {2006}, 
  month = jun, 
  institution = {Ecole Nationale Supérieure des Mines de Saint-Etienne}
}

Amélie Imafouo, Xavier Tannier.

Retrieval Status Values in Information Retrieval Evaluation.

Research Report 2006-400-002, March 2006. Ecole Nationale Supérieure des Mines de Saint-Etienne.

ⓘ [abstract] [BibTeX] [pdf] [ps.gz]

Evaluation is one major step of an IR process. It helps to classify and compare Information Retrieval (IR) systems according to their effectiveness or their efficiency. Many IR works carried different kinds of studies about evaluation methods and many evaluation measures have been proposed. Most of these measures are based on the ranking of documents retrieved by IR systems in response to queries. The retrieved documents are ranked according to their retrieval status values if these are monotonously increasing with the probability of relevance of documents. However, few IR works tried to know more about these RSV and their possible use for IR evaluation. In this work, we analyze different RSV computations and investigate the links between the RSVs and the IR systems evaluation.

@TechReport{Tannier06f, 
  title = {{Retrieval Status Values in Information Retrieval Evaluation}},
  author = {Amélie Imafouo and Xavier Tannier},
  number = {2006-400-002}, 
  year = {2006}, 
  month = mar, 
  institution = {Ecole Nationale Supérieure des Mines de Saint-Etienne}
}

Xavier Tannier, Shlomo Geva.

XML Retrieval with a Natural Language Interface.

in String Processing and Information Retrieval: 12th International Conference, SPIRE 2005. Buenos Aires, Argentina, pages 29-40, November 2005. © Springer-Verlag. Lecture Notes in Computer Science

ⓘ [abstract] [BibTeX] [ps.gz] [pdf]

Effective information retrieval in XML documents requires the user to have good knowledge of document structure and of some formal query language. XML query languages like XPath and XQuery are too complex to be considered for use by end users. We present an approach to XML query processing that supports the specification of both textual and structural constraints in natural language. We implemented a system that supports the evaluation of both formal XPath-like queries and natural language XML queries. We present comparative test results that were performed with the INEX 2004 topics and XML collection. Our results quantify the trade-off in performance of natural language XML queries vs formal queries with favourable results.

@InProceedings{Tannier05g, 
  title = {{XML Retrieval with a Natural Language Interface}},
  author = {Xavier Tannier and Shlomo Geva},
  booktitle = {String Processing and Information Retrieval: 12th International Conference, SPIRE 2005}, 
  address = {Buenos Aires, Argentina}, 
  year = {2005}, 
  month = nov, 
  publisher = {Springer-Verlag}, 
  series = {Lecture Notes in Computer Science}, 
  editor = {Mariano Consens and Gonzalo Navarro}, 
  pages = {29-40}
}

Amélie Imafouo, Xavier Tannier.

Retrieval Status Values in Information Retrieval Evaluation.

in String Processing and Information Retrieval: 12th International Conference, SPIRE 2005. Buenos Aires, Argentina, pages 222-227, November 2005. © Springer-Verlag. Lecture Notes in Computer Science

ⓘ [abstract] [BibTeX] [pdf] [SpringerLink] [ps.gz]

Retrieval systems rank documents according to their retrieval status values (RSV) if these are monotonously increasing with the probability of relevance of documents. In this work, we investigate the links between RSVs and IR system evaluation.

@InProceedings{Tannier05f, 
  title = {{Retrieval Status Values in Information Retrieval Evaluation}},
  author = {Amélie Imafouo and Xavier Tannier},
  booktitle = {String Processing and Information Retrieval: 12th International Conference, SPIRE 2005}, 
  address = {Buenos Aires, Argentina}, 
  year = {2005}, 
  month = nov, 
  publisher = {Springer-Verlag}, 
  series = {Lecture Notes in Computer Science}, 
  editor = {Mariano Consens and Gonzalo Navarro}, 
  pages = {222-227}
}

Xavier Tannier, Jean-Jacques Girardot, Mihaela Mathieu.

Utilisation de la langue naturelle pour l'interrogation de documents structurés.

in Actes de la 2ème conférence en Recherche d'Information et Applications (CORIA 05). Grenoble, France, pages 135-150, 2005.

ⓘ [abstract] [BibTeX] [ps.gz] [pdf]

A query language is a necessary interface between the user and the search engine. Extremely simplified in the case of a retrieval performed on flat documents, this language becomes more complex when dealing with structured documents. Indeed we need then to specify constraints on both content and structure. In our approach we propose to use natural language as an interface to express such requests.
This paper describes first the different steps that we perform in order to transform (in an information retrieval framework) the natural language request into a context-free semantic representation. Some structure- and domain-specific rules are then applied, in order to obtain a final form, adapted to a conversion into a formal query language. Finally we describe our first experimentations and discuss different aspects of our approach.

Le langage de requête est l'indispensable interface entre l'utilisateur et l'outil de recherche. Simplifié au maximum dans les cas où les moteurs indexent essentiellement des documents plats, il devient fort complexe lorsqu'il s'adresse à des documents structurés et qu'il s'agit de définir des contraintes portant à la fois sur la structure et le contenu. L'approche ici-décrite propose d'utiliser la langue naturelle comme interface pour exprimer de telles requêtes.
L'article décrit dans un premier temps les différentes phases qui permettent de transformer (dans un cadre de recherche d'information) la requête en langage naturel en une représentation sémantique indépendante du contexte. Des règles de simplification adaptées à la structure et au domaine du corpus sont ensuite appliquées, permettant d'obtenir une forme finale, adaptée à une conversion vers un langage de requête formel. L'article décrit enfin les expérimentations effectuées et tire les premières conclusions sur divers aspects de cette approche.

@InProceedings{Tannier05a, 
  title = {{Utilisation de la langue naturelle pour l'interrogation de documents structurés}},
  author = {Xavier Tannier and Jean-Jacques Girardot and Mihaela Mathieu},
  booktitle = {Actes de la 2ème conférence en Recherche d'Information et Applications (CORIA 05)}, 
  address = {Grenoble, France}, 
  year = {2005}, 
  pages = {135-150}
}

Xavier Tannier, Jean-Jacques Girardot, Mihaela Mathieu.

Analysing Natural Language Queries at INEX 2004.

in Advances in XML Information Retrieval: Third Workshop of the Initiative for the Evaluation of XML retrieval (INEX). Schloss Dagstuhl, Germany, pages 395-409, December 2005. Lecture Notes in Computer Science (LNCS 3493).

ⓘ [abstract] [BibTeX] [SpringerLink] [ps.gz]

This article presents the contribution of the ''Ecole Nationale Supérieure des Mines de Saint-Etienne (France)'' to the new Natural Language Processing special Track of the third Initiative for Evaluation of XML Retrieval (INEX 2004). It discusses the place of NLP in XML retrieval and presents a method to analyse natural language queries.

@InProceedings{Tannier04e, 
  title = {{Analysing Natural Language Queries at INEX 2004}},
  author = {Xavier Tannier and Jean-Jacques Girardot and Mihaela Mathieu},
  booktitle = {Advances in XML Information Retrieval: Third Workshop of the Initiative for the Evaluation of XML retrieval (INEX)}, 
  address = {Schloss Dagstuhl, Germany}, 
  year = {2005}, 
  month = dec, 
  series = {Lecture Notes in Computer Science}, 
  volume = {LNCS 3493}, 
  editor = {Norbert Fuhr and Mounia Lalmas and Saadia Malik and Zolt\`an Szl\`avik}, 
  pages = {395-409}
}

Xavier Tannier, Jean-Jacques Girardot, Mihaela Mathieu.

Natural Language Queries for Information Retrieval in Structured Documents.

in Proceedings of the International Conference on Advances in Intelligent Systems - Theory and Applications (AISTA 2004). Luxembourg, 2004.

ⓘ [abstract] [BibTeX] [ps.gz] [pdf]

Information Retrieval in structured documents (and particularly XML) requires the user to have a good knowledge of the document structure and of some query language. This article discusses the advantages that could be brought by a system allowing natural language queries, and presents a technique to translate such requests into a formal query language.

@InProceedings{Tannier04c, 
  title = {{Natural Language Queries for Information Retrieval in Structured Documents}},
  author = {Xavier Tannier and Jean-Jacques Girardot and Mihaela Mathieu},
  booktitle = {Proceedings of the International Conference on Advances in Intelligent Systems - Theory and Applications (AISTA 2004)}, 
  address = {Luxembourg}, 
  year = {2004}
}

Xavier Tannier, Jean-Jacques Girardot, Mihaela Mathieu.

Utilisation de la langue naturelle pour l'interrogation de documents structurés.

Research Report 2004-400-010, December 2004. Ecole Nationale Supérieure des Mines de Saint-Etienne.

Egalement publié (sans annexe) dans les Actes de la Deuxième Conférence en Recherche d'Information et Applications (CORIA 2005), pages 135-150, 9-11 mars 2005, Grenoble, France.

ⓘ [abstract] [BibTeX] [pdf] [ps.gz]

A query language is a necessary interface between the user and the search engine. Extremely simplified in the case of a retrieval performed on flat documents, this language becomes more complex when dealing with structured documents. Indeed we need then to specify constraints on both content and structure. In our approach we propose to use natural language as an interface to express such requests.
This paper describes first the different steps that we perform in order to transform (in an information retrieval framework) the natural language request into a context-free semantic representation. Some structure- and domain-specific rules are then applied, in order to obtain a final form, adapted to a conversion into a formal query language. Finally we describe our first experimentations and discuss different aspects of our approach.

Le langage de requête est l'indispensable interface entre l'utilisateur et l'outil de recherche. Simplifié au maximum dans les cas où les moteurs indexent essentiellement des documents plats, il devient fort complexe lorsqu'il s'adresse à des documents structurés et qu'il s'agit de définir des contraintes portant à la fois sur la structure et le contenu. L'approche ici-décrite propose d'utiliser la langue naturelle comme interface pour exprimer de telles requêtes.
L'article décrit dans un premier temps les différentes phases qui permettent de transformer (dans un cadre de recherche d'information) la requête en langage naturel en une représentation sémantique indépendante du contexte. Des règles de simplification adaptées à la structure et au domaine du corpus sont ensuite appliquées, permettant d'obtenir une forme finale, adaptée à une conversion vers un langage de requête formel. L'article décrit enfin les expérimentations effectuées et tire les premières conclusions sur divers aspects de cette approche.

@TechReport{Tannier04d, 
  title = {{Utilisation de la langue naturelle pour l'interrogation de documents structurés}},
  author = {Xavier Tannier and Jean-Jacques Girardot and Mihaela Mathieu},
  number = {2004-400-010}, 
  year = {2004}, 
  month = dec, 
  institution = {Ecole Nationale Supérieure des Mines de Saint-Etienne}
}

Xavier Tannier.

Traitement des événements et ciblage d'information.

June 2014. Habilitation à Diriger des Recherches (HDR)

ⓘ [abstract] [BibTeX] [Slides in pdf] [Slides in pptx] [Thesis]

Dans ce mémoire, nous organisons nos travaux principaux autour de quatre axes de traitement des informations textuelles : le ciblage, l'agrégation, la hiérarchisation et la contextualisation d'information. La majeure partie du document est dédiée à l'analyse des événements. Nous introduisons d'abord la notion d'événement à travers les diverses spécialités du traitement automatique des langues qui s'en sont préoccupées. Nous proposons ainsi un survol des différents modes de représentation des événements, tout en instaurant un fil rouge pour l'ensemble de la première partie. Nous distinguons ensuite deux grand es classes de travaux autour des événements, deux grandes visions que nous avons nommées, pour la première, l'"événement dans le texte", et pour la seconde, l'"événement dans le monde". Dans la première, nous considérons l'événement comme la désignation linguistique de quelque chose qui se passe, et nous tentons d'une part d'identifier ces désignations dans les textes, et d'autre part d'induire les relations temporelles existant entre ces événements, que ce soit dans des textes journalistiques ou médicaux. Nous réfléchissons enfin à une métrique d'évaluation adaptée à ce type d'informations. Pour ce qui est de l'"événement dans le monde", nous envisageons plus l'événement tel qu'il est perçu par le citoyen, et nous proposons plusieurs approches originales pour aider celui-ci à mieux appréhender la quantité écrasante d'événements dont il prend connaissance chaque jour : les chronologies thématiques, les fils temporels, et une approche automatisée du journalisme de données. La deuxième partie revient sur des travaux en lien avec le ciblage d'information. Nous décrivons tout d'abord nos travaux sur les systèmes de questions-réponses, dans les quels nous avons eu recours à l'analyse syntaxique pour aider à justifier les réponses trouvées à une question en langage naturel. Enfin, nous abordons le sujet de la collecte thématique de documents sur le Web, dans le but de créer automatiquement des corpus et des lexiques spécialisés. Enfin, nous concluons et revenons sur les perspectives associées aux travaux présentés sur les événements, avec pour but d'abolir partiellement la frontière qui séparent les différents axes présentés.

@Misc{Tannier2014b, 
  title = {{Traitement des événements et ciblage d'information}},
  author = {Xavier Tannier},
  year = {2014}, 
  month = jun, 
  school = {Université Paris-Sud, École Doctorale d'Informatique}, 
  howpublished = {Habilitation à Diriger des Recherches (HDR)}, 
  note = {}
}

Xavier Tannier.

WebAnnotator, an Annotation Tool for Web Pages.

in Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012). Istanbul, Turkey, May 2012.

ⓘ [abstract] [BibTeX] [Poster] [Manual] [free copy] [Official Page]

This article presents WebAnnotator, a new tool for annotating Web pages. WebAnnotator is implemented as a Firefox extension, allowing annotation of both offline and inline pages. The HTML rendering fully preserved and all annotations consist in new HTML spans with specific styles. WebAnnotator provides an easy and general-purpose framework and is made available under CeCILL free license (close to GNU GPL), so that use and further contributions are made simple. All parts of an HTML document can be annotated: text, images, videos, tables, menus, etc. The annotations are created by simply selecting a part of the document and clicking on the relevant type and subtypes. The annotated elements are then highlighted in a specific color. Annotation schemas can be defined by the user by creating a simple DTD representing the types and subtypes that must be highlighted. Finally, annotations can be saved (HTML with highlighted parts of documents) or exported (in a machine-readable format).
Official web page
Manual

@InProceedings{Tannier2012b, 
  title = {{WebAnnotator, an Annotation Tool for Web Pages}},
  author = {Xavier Tannier},
  booktitle = {Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)}, 
  address = {Istanbul, Turkey}, 
  year = {2012}, 
  month = may
}

Pierre Zweigenbaum, Brigitte Grau, Anne-Laure Ligozat, Isabelle Robba, Sophie Rosset, Xavier Tannier, Anne Vilnat, Patrice Bellot.

Apports de la linguistique dans les systèmes de recherche d'informations précises.

Revue Française de Linguistique Appliquée. Vol. XIII, Issue 1, pages 41-62, 2008.

ⓘ [abstract] [BibTeX] [Table of Content]

La recherche de réponses précises à des questions, aussi appelée "questions-réponses", est une évolution des systèmes de recherche d'information : peut-elle, comme ses prédécesseurs, se satisfaire de méthodes essentiellement numériques, utilisant extrêmement peu de connaissances linguistiques.
Après avoir présenté la tâche de questions-réponses et les enjeux qu'elle soulève, nous examinons jusqu'où on peut la réaliser avec très peu de connaissances linguistiques. Nous passons ensuite en revue les différents types de connaissances linguistiques que les équipes ont été amenées à mobiliser : connaissances syntaxiques et sémantiques pour l'analyse de phrases, rôle de la reconnaissance d'"entités nommées", prise en compte de la dimension textuelle des documents. Une discussion sur les contributions respectives des méthodes linguistiques et non linguistiques clôt l'article.

@Article{Tannier08c, 
  title = {{Apports de la linguistique dans les systèmes de recherche d'informations précises}},
  author = {Pierre Zweigenbaum and Brigitte Grau and Anne-Laure Ligozat and Isabelle Robba and Sophie Rosset and Xavier Tannier and Anne Vilnat and Patrice Bellot},
  number = {1}, 
  year = {2008}, 
  journal = {Revue Française de Linguistique Appliquée}, 
  volume = {XIII}, 
  pages = {41-62}
}

Xavier Tannier.

Traiter les documents XML avec les "contextes de lecture".

Traitement Automatique des Langues. Vol. 47, Issue 1, 2007.

ⓘ [abstract] [BibTeX] [free copy]

Some tags used in XML documents create arbitrary breaks in the natural flow of the text. This flexibility may raise some difficulties for some techniques of document engineering. This article presents this issue and proposes answers, theoretically first, with the introduction of a new concept of reading context, and in practice afterwards, with an automatic classification of tags and the presentation of a generic tool for XML content handling.

Le langage XML autorise, par sa souplesse de structuration, des manipulations du contenu qui créent parfois des ruptures arbitraires dans le flot naturel du texte. Ces caractéristiques soulèvent des difficultés lorsque l'on souhaite mettre en œuvre des techniques d'analyse automatique du contenu des documents XML. Cet article présente cette problématique et y répond, sur le plan théorique, avec l'introduction du concept de contexte de lecture, puis sur le plan pratique, avec une classification automatique des balises XML et la présentation d'un outil générique de gestion des contenus XML.

@Article{Tannier07a, 
  title = {{Traiter les documents XML avec les "contextes de lecture"}},
  author = {Xavier Tannier},
  number = {1}, 
  year = {2007}, 
  journal = {Traitement Automatique des Langues}, 
  volume = {47}
}

Xavier Tannier.

Extraction et recherche d'information en langage naturel dans les documents semi-structurés.

PhD Thesis, September 2006. Ecole Nationale Supérieure des Mines de Saint-Etienne.

ⓘ [abstract] [BibTeX] [ppt] [Amazon] [pdf] [ps.gz]

Information retrieval in semi-structured (practically written in XML) mixes aspects of traditional information retrieval and of database querying. The structure is very important, but the information need is vague. The retrieval unit can have different sizes (a paragraph, a figure, an entire article\dots). Furthermore, XML flexibility may create some breaks in the natural flow of the text.
Problems raised at this level are many, notably for document content analysis and querying. We studied the specific solutions that could bring the natural language processing (NLP) techniques. We proposed a theoretical frame and a practical approach to allow the use of traditional textual analysis techniques in XML documents, disregarding the structure. We also conceived an interface for querying XML documents in natural language, and proposed methods using the structure in order to improve the retrieval of relevant elements.

La recherche d'information (RI) dans des documents semi-structurés (écrits en XML en pratique) combine des aspects de la RI traditionnelle et ceux de l'interrogation de bases de données. La structure a une importance primordiale, mais le besoin d'information reste vague. L'unité de recherche est variable (un paragraphe, une figure, un article complet\dots). Par ailleurs, la flexibilité du langage XML autorise des manipulations du contenu qui provoquent parfois des ruptures arbitraires dans le flot naturel du texte.
Les problèmes posés par ces caractéristiques sont nombreux, que ce soit au niveau du pré-traitement des documents ou de leur interrogation. Face à ces problèmes, nous avons étudié les solutions spécifiques que pouvait apporter le traitement automatique de la langue (TAL). Nous avons ainsi proposé un cadre théorique et une approche pratique pour permettre l'utilisation des techniques d'analyse textuelle en faisant abstraction de la structure. Nous avons également conçu une interface d'interrogation en langage naturel pour la RI dans les documents XML, et proposé des méthodes tirant profit de la structure pour améliorer la recherche des éléments pertinents.

@PhDThesis{Tannier06d, 
  title = {{Extraction et recherche d'information en langage naturel dans les documents semi-structurés}},
  author = {Xavier Tannier},
  year = {2006}, 
  month = sep, 
  school = {Ecole Nationale Supérieure des Mines de Saint-Etienne}
}

Alan Woodley, Xavier Tannier, Marcus Hassler, Shlomo Geva.

Natural Language Processing and XML Retrieval.

in Proceedings of the Australasian Language Technology Workshop (ALTW 2006), short paper. Sydney, Australia, 2006.

ⓘ [abstract] [BibTeX] [pdf]

XML information retrieval (XML-IR) systems respond to user queries with results more specific than documents. XML-IR queries contain both bontent and structural requirements traditionally expressed in a formal language. However, an intuitive alternative is natural language queries (NLQs). Here, we discuss three approaches for handling NLQs in an XML-IR system that are comparable to, and even outperform formal language queries.

@InProceedings{Tannier06h, 
  title = {{Natural Language Processing and XML Retrieval}},
  author = {Alan Woodley and Xavier Tannier and Marcus Hassler and Shlomo Geva},
  booktitle = {Proceedings of the Australasian Language Technology Workshop (ALTW 2006), short paper}, 
  address = {Sydney, Australia}, 
  year = {2006}
}

Xavier Tannier.

From natural language to NEXI, an interface for INEX 2005 queries..

in Advances in XML Information Retrieval: Fourth Workshop of the Initiative for the Evaluation of XML retrieval (INEX). Schloss Dagstuhl, Germany, November 2006. © Springer Verlag. Lecture Notes in Computer Science (LNCS 3977).

ⓘ [abstract] [BibTeX] [ps.gz] [pdf]

Offering the possibility to query any XML retrieval system in natural language would be very helpful to a lot of users. In 2005, INEX proposed a framework to partipants that wanted to implement a natural language interface for the retrieval of XML documents, independantly of the search engine. This year the different teams participating to this task had wonderful results. This paper describes our contribution to this project and presents some opinions concerning the task.

@InProceedings{Tannier06a, 
  title = {{From natural language to NEXI, an interface for INEX 2005 queries.}},
  author = {Xavier Tannier},
  booktitle = {Advances in XML Information Retrieval: Fourth Workshop of the Initiative for the Evaluation of XML retrieval (INEX)}, 
  address = {Schloss Dagstuhl, Germany}, 
  year = {2006}, 
  month = nov, 
  publisher = {Springer Verlag}, 
  series = {Lecture Notes in Computer Science}, 
  volume = {LNCS 3977}, 
  editor = {Norbert Fuhr, Mounia Lalmas, Saadia Malik and Gabriella Kazai}
}

Xavier Tannier, Alan Woodley, Shlomo Geva, Marcus Hassler.

Approaches to Translating Natural Language Queries for use in XML Information Retrieval Systems.

Research Report 2006-400-008, July 2006. Ecole Nationale Supérieure des Mines de Saint-Etienne.

ⓘ [abstract] [BibTeX] [pdf]

XML information retrieval (XML-IR) systems aim to provide users with relevant results that are more specific than documents. To interact with XML-IR systems, users must express both their content and structural requirements in the form of a structured query, using formal languages such as XPath or NEXI. Natural language queries (NLQs) are a more intuitive alternative. Here, we present three approaches that analyse NLQs and translate them into a formal language (NEXI) query. The approaches participated in INEX's 2005 NLP track, where they performed strongly, even outperforming a baseline that consisted of manually constructed NEXI expressions. This suggests that further collaboration between NLP and XML-IR could be mutually beneficial.

@TechReport{Tannier06g, 
  title = {{Approaches to Translating Natural Language Queries for use in XML Information Retrieval Systems}},
  author = {Xavier Tannier and Alan Woodley and Shlomo Geva and Marcus Hassler},
  number = {2006-400-008}, 
  year = {2006}, 
  month = jul, 
  institution = {Ecole Nationale Supérieure des Mines de Saint-Etienne}
}

Xavier Tannier.

Traitement automatique du langage naturel pour l'extraction et la recherche d'information.

Research Report 2006-400-006, March 2006. Ecole Nationale Supérieure des Mines de Saint-Etienne.

ⓘ [abstract] [BibTeX] [pdf]

Ce document est la version longue d'un chapitre d'état de l'art d'une thèse concernant l'extraction et la recherche d'information en langage naturel dans les documents structurés. Nous avons pensé qu'il pouvait être utile en lui-même aux personnes désireuses de se familiariser avec les problématiques de l'analyse des documents en langage naturel.
Cet état de l'art n'est donc en aucun cas une description exhaustive des applications des techniques de traitement automatique des langues, mais une introduction à certaines problématiques, choisies au départ pour correspondre avec le sujet général de la thèse présentée.

@TechReport{Tannier06c, 
  title = {{Traitement automatique du langage naturel pour l'extraction et la recherche d'information}},
  author = {Xavier Tannier},
  number = {2006-400-006}, 
  year = {2006}, 
  month = mar, 
  institution = {Ecole Nationale Supérieure des Mines de Saint-Etienne}
}

Xavier Tannier, Shlomo Geva.

XML Retrieval with a Natural Language Interface.

in String Processing and Information Retrieval: 12th International Conference, SPIRE 2005. Buenos Aires, Argentina, pages 29-40, November 2005. © Springer-Verlag. Lecture Notes in Computer Science

ⓘ [abstract] [BibTeX] [ps.gz] [pdf]

Effective information retrieval in XML documents requires the user to have good knowledge of document structure and of some formal query language. XML query languages like XPath and XQuery are too complex to be considered for use by end users. We present an approach to XML query processing that supports the specification of both textual and structural constraints in natural language. We implemented a system that supports the evaluation of both formal XPath-like queries and natural language XML queries. We present comparative test results that were performed with the INEX 2004 topics and XML collection. Our results quantify the trade-off in performance of natural language XML queries vs formal queries with favourable results.

@InProceedings{Tannier05g, 
  title = {{XML Retrieval with a Natural Language Interface}},
  author = {Xavier Tannier and Shlomo Geva},
  booktitle = {String Processing and Information Retrieval: 12th International Conference, SPIRE 2005}, 
  address = {Buenos Aires, Argentina}, 
  year = {2005}, 
  month = nov, 
  publisher = {Springer-Verlag}, 
  series = {Lecture Notes in Computer Science}, 
  editor = {Mariano Consens and Gonzalo Navarro}, 
  pages = {29-40}
}

Xavier Tannier, Jean-Jacques Girardot, Mihaela Mathieu.

Classifying XML Tags through "Reading Contexts".

in Proceedings of the 2005 ACM Symposium on Document Engineering. Bristol, United Kingdom, pages 143-145, November 2005. © ACM Press, New York City, NY, USA.

ⓘ [abstract] [BibTeX] [pdf] [ACM link] [ps.gz]

Some tags used in XML documents create arbitrary breaks in the natural flow of the text. This may constitute an impediment to the application of some methods of document engineering. This article introduces the concept of "reading contexts", and gives clues to handle it theorically and in practice. This work should notably allow to recognize emphasis tags in a text, to define a new concept of term proximity in structured documents, to improve indexing techniques, and also to open up the way to advanced linguistic analyses of XML corpora.

@InProceedings{Tannier05h, 
  title = {{Classifying XML Tags through "Reading Contexts"}},
  author = {Xavier Tannier and Jean-Jacques Girardot and Mihaela Mathieu},
  booktitle = {Proceedings of the 2005 ACM Symposium on Document Engineering}, 
  address = {Bristol, United Kingdom}, 
  year = {2005}, 
  month = nov, 
  publisher = {ACM Press, New York City, NY, USA}, 
  editor = {Peter R. King}, 
  pages = {143-145}
}

Xavier Tannier, Jean-Jacques Girardot, Mihaela Mathieu.

Utilisation de la langue naturelle pour l'interrogation de documents structurés.

in Actes de la 2ème conférence en Recherche d'Information et Applications (CORIA 05). Grenoble, France, pages 135-150, 2005.

ⓘ [abstract] [BibTeX] [ps.gz] [pdf]

A query language is a necessary interface between the user and the search engine. Extremely simplified in the case of a retrieval performed on flat documents, this language becomes more complex when dealing with structured documents. Indeed we need then to specify constraints on both content and structure. In our approach we propose to use natural language as an interface to express such requests.
This paper describes first the different steps that we perform in order to transform (in an information retrieval framework) the natural language request into a context-free semantic representation. Some structure- and domain-specific rules are then applied, in order to obtain a final form, adapted to a conversion into a formal query language. Finally we describe our first experimentations and discuss different aspects of our approach.

Le langage de requête est l'indispensable interface entre l'utilisateur et l'outil de recherche. Simplifié au maximum dans les cas où les moteurs indexent essentiellement des documents plats, il devient fort complexe lorsqu'il s'adresse à des documents structurés et qu'il s'agit de définir des contraintes portant à la fois sur la structure et le contenu. L'approche ici-décrite propose d'utiliser la langue naturelle comme interface pour exprimer de telles requêtes.
L'article décrit dans un premier temps les différentes phases qui permettent de transformer (dans un cadre de recherche d'information) la requête en langage naturel en une représentation sémantique indépendante du contexte. Des règles de simplification adaptées à la structure et au domaine du corpus sont ensuite appliquées, permettant d'obtenir une forme finale, adaptée à une conversion vers un langage de requête formel. L'article décrit enfin les expérimentations effectuées et tire les premières conclusions sur divers aspects de cette approche.

@InProceedings{Tannier05a, 
  title = {{Utilisation de la langue naturelle pour l'interrogation de documents structurés}},
  author = {Xavier Tannier and Jean-Jacques Girardot and Mihaela Mathieu},
  booktitle = {Actes de la 2ème conférence en Recherche d'Information et Applications (CORIA 05)}, 
  address = {Grenoble, France}, 
  year = {2005}, 
  pages = {135-150}
}

Xavier Tannier, Jean-Jacques Girardot, Mihaela Mathieu.

Analysing Natural Language Queries at INEX 2004.

in Advances in XML Information Retrieval: Third Workshop of the Initiative for the Evaluation of XML retrieval (INEX). Schloss Dagstuhl, Germany, pages 395-409, December 2005. Lecture Notes in Computer Science (LNCS 3493).

ⓘ [abstract] [BibTeX] [SpringerLink] [ps.gz]

This article presents the contribution of the ''Ecole Nationale Supérieure des Mines de Saint-Etienne (France)'' to the new Natural Language Processing special Track of the third Initiative for Evaluation of XML Retrieval (INEX 2004). It discusses the place of NLP in XML retrieval and presents a method to analyse natural language queries.

@InProceedings{Tannier04e, 
  title = {{Analysing Natural Language Queries at INEX 2004}},
  author = {Xavier Tannier and Jean-Jacques Girardot and Mihaela Mathieu},
  booktitle = {Advances in XML Information Retrieval: Third Workshop of the Initiative for the Evaluation of XML retrieval (INEX)}, 
  address = {Schloss Dagstuhl, Germany}, 
  year = {2005}, 
  month = dec, 
  series = {Lecture Notes in Computer Science}, 
  volume = {LNCS 3493}, 
  editor = {Norbert Fuhr and Mounia Lalmas and Saadia Malik and Zolt\`an Szl\`avik}, 
  pages = {395-409}
}

Xavier Tannier, Jean-Jacques Girardot, Mihaela Mathieu.

XGTagger, an open-source interface dealing with XML contents.

in Proceedings of the workshop on Open Source Web Information Retrieval (OSWIR 2005). Compiègne, France, 2005.

ⓘ [abstract] [BibTeX] [pdf] [ps.gz]

This article presents an open-source interface dealing with XML contents and simplifying their analysis. This tool, called XGTagger, allows to use any existing system developed for text only, for any purpose. It takes an XML document in input and creates a new one, adding information brought by the system.
We also present the concept of "reading contexts" and show how our tool deals with them.

@InProceedings{Tannier04f, 
  title = {{XGTagger, an open-source interface dealing with XML contents}},
  author = {Xavier Tannier and Jean-Jacques Girardot and Mihaela Mathieu},
  booktitle = {Proceedings of the workshop on Open Source Web Information Retrieval (OSWIR 2005)}, 
  address = {Compiègne, France}, 
  year = {2005}, 
  editor = {Michel Beigbeder and Wai Gen Yee}
}

Xavier Tannier.

Dealing with XML structure through ''Reading Contexts''.

Research Report 2005-400-007, 2005. Ecole Nationale Supérieure des Mines de Saint-Etienne.

ⓘ [abstract] [BibTeX] [pdf] [ps.gz]

Some tags used in XML documents create arbitrary breaks in the natural flow of the text. This may constitute an impediment to the application of some methods of information retrieval.
This article goes back over an existing tag categorization allowing to distinguish different ways to manage textual content of XML elements. It gives for tag classes a clear definition, through the introduction of a new concept of "reading contexts". Furthermore it proposes a method that uses natural language processing techniques in order to find the class of XML tag names automatically. This work notably allows to recognize emphasis tags in a text, to define a new concept of term logical proximity in structured documents, to improve indexing techniques, but also to open up the way to advanced linguistic analyses of XML corpora.

@TechReport{Tannier05b, 
  title = {{Dealing with XML structure through ''Reading Contexts''}},
  author = {Xavier Tannier},
  number = {2005-400-007}, 
  year = {2005}, 
  institution = {Ecole Nationale Supérieure des Mines de Saint-Etienne}
}

Xavier Tannier, Aude Garnier.

XGTagger, a generic interface for analysing XML content.

Research Report 2005-400-008, 2005. Ecole Nationale Supérieure des Mines de Saint-Etienne.

ⓘ [abstract] [BibTeX] [pdf] [ps.gz]

XGTagger is a generic interface dealing with text contained in XML documents. It has been designed for part-of-speech tagging especially, but can be used for other applications, among which lexical operations, syntactic analyses and automatic translation.

@TechReport{Tannier05c, 
  title = {{XGTagger, a generic interface for analysing XML content}},
  author = {Xavier Tannier and Aude Garnier},
  number = {2005-400-008}, 
  year = {2005}, 
  institution = {Ecole Nationale Supérieure des Mines de Saint-Etienne}
}

Xavier Tannier, Jean-Jacques Girardot, Mihaela Mathieu.

Natural Language Queries for Information Retrieval in Structured Documents.

in Proceedings of the International Conference on Advances in Intelligent Systems - Theory and Applications (AISTA 2004). Luxembourg, 2004.

ⓘ [abstract] [BibTeX] [ps.gz] [pdf]

Information Retrieval in structured documents (and particularly XML) requires the user to have a good knowledge of the document structure and of some query language. This article discusses the advantages that could be brought by a system allowing natural language queries, and presents a technique to translate such requests into a formal query language.

@InProceedings{Tannier04c, 
  title = {{Natural Language Queries for Information Retrieval in Structured Documents}},
  author = {Xavier Tannier and Jean-Jacques Girardot and Mihaela Mathieu},
  booktitle = {Proceedings of the International Conference on Advances in Intelligent Systems - Theory and Applications (AISTA 2004)}, 
  address = {Luxembourg}, 
  year = {2004}
}

Xavier Tannier, Jean-Jacques Girardot, Mihaela Mathieu.

Utilisation de la langue naturelle pour l'interrogation de documents structurés.

Research Report 2004-400-010, December 2004. Ecole Nationale Supérieure des Mines de Saint-Etienne.

Egalement publié (sans annexe) dans les Actes de la Deuxième Conférence en Recherche d'Information et Applications (CORIA 2005), pages 135-150, 9-11 mars 2005, Grenoble, France.

ⓘ [abstract] [BibTeX] [pdf] [ps.gz]

A query language is a necessary interface between the user and the search engine. Extremely simplified in the case of a retrieval performed on flat documents, this language becomes more complex when dealing with structured documents. Indeed we need then to specify constraints on both content and structure. In our approach we propose to use natural language as an interface to express such requests.
This paper describes first the different steps that we perform in order to transform (in an information retrieval framework) the natural language request into a context-free semantic representation. Some structure- and domain-specific rules are then applied, in order to obtain a final form, adapted to a conversion into a formal query language. Finally we describe our first experimentations and discuss different aspects of our approach.

Le langage de requête est l'indispensable interface entre l'utilisateur et l'outil de recherche. Simplifié au maximum dans les cas où les moteurs indexent essentiellement des documents plats, il devient fort complexe lorsqu'il s'adresse à des documents structurés et qu'il s'agit de définir des contraintes portant à la fois sur la structure et le contenu. L'approche ici-décrite propose d'utiliser la langue naturelle comme interface pour exprimer de telles requêtes.
L'article décrit dans un premier temps les différentes phases qui permettent de transformer (dans un cadre de recherche d'information) la requête en langage naturel en une représentation sémantique indépendante du contexte. Des règles de simplification adaptées à la structure et au domaine du corpus sont ensuite appliquées, permettant d'obtenir une forme finale, adaptée à une conversion vers un langage de requête formel. L'article décrit enfin les expérimentations effectuées et tire les premières conclusions sur divers aspects de cette approche.

@TechReport{Tannier04d, 
  title = {{Utilisation de la langue naturelle pour l'interrogation de documents structurés}},
  author = {Xavier Tannier and Jean-Jacques Girardot and Mihaela Mathieu},
  number = {2004-400-010}, 
  year = {2004}, 
  month = dec, 
  institution = {Ecole Nationale Supérieure des Mines de Saint-Etienne}
}

Xavier Tannier, Dipak Kalra.

Clinical Research Informatics: Contributions from 2023.

Yearbook of Medical Informatics. 33, August 2024. doi: 10.1055/s-0044-1800733

ⓘ [abstract] [BibTeX] [Thieme link]

Objectives: To summarize key contributions to current research in the field of Clinical Research Informatics (CRI) and to select best papers published in 2023.
Methods: A bibliographic search using a combination of MeSH descriptors and free-text terms on CRI was performed using PubMed, followed by a double-blind review in order to select a list of candidate best papers to be then peer-reviewed by external reviewers. After peer-review ranking, a consensus meeting between the two section editors and the editorial team was organized to finally conclude on the selected three best papers.
Results: Among the 1,119 papers returned by the search, published in 2023, that were in the scope of the various areas of CRI, the full review process selected three best papers. The first best paper describes the process undertaken in Germany, under the national Medical Informatics Initiative, to define and validate a provenance metadata framework to enable the interpretation including quality assessment of health data reused for research. The authors of the second-best paper present a methodology for the generation of computable phenotypes and the covariates associated with success rates in e-phenotype validation. The third-best presents a review of published and accessible tools that enable the assessment of health data quality through an automated process. This year's survey paper marks the tenth anniversary of the CRI section of the Yearbook by reviewing the dominant themes within CRI over the past decade and the major milestone innovations within this field.
Conclusions: The literature relevant to CRI in 2023 has largely been populated by publications that assess and enhance the reusability of health data for clinical research, in particular data quality assessment and metadata management.

@Article{Tannier2024b, 
  title = {{Clinical Research Informatics: Contributions from 2023}},
  author = {Tannier, Xavier and Kalra, Dipak},
  year = {2024}, 
  month = aug, 
  journal = {Yearbook of Medical Informatics}, 
  volume = {33}, 
  doi = {10.1055/s-0044-1800733}
}

Xavier Tannier, Dipak Kalra.

Clinical Research Informatics: Contributions from 2022.

Yearbook of Medical Informatics. 32, December 2023. doi: 10.1055/s-0043-1768748

ⓘ [abstract] [BibTeX] [Thieme link]

Objectives: To summarize key contributions to current research in the field of Clinical Research Informatics (CRI) and to select best papers published in 2022.
Method: A bibliographic search using a combination of Medical Subject Headings (MeSH) descriptors and free-text terms on CRI was performed using PubMed, followed by a double-blind review in order to select a list of candidate best papers to be then peer-reviewed by external reviewers. After peer-review ranking, a consensus meeting between the two section editors and the editorial team was organized to finally conclude on the selected three best papers.
Results: Among the 1,324 papers returned by the search, published in 2022, that were in the scope of the various areas of CRI, the full review process selected four best papers. The first best paper describes the process undertaken in Germany, under the national Medical Informatics Initiative, to define a process and to gain multi-decision-maker acceptance of broad consent for the reuse of health data for research whilst remaining compliant with the European General Data Protection Regulation. The authors of the second-best paper present a federated architecture for the conduct of clinical trial feasibility queries that utilizes HL7 Fast Healthcare Interoperability Resources and an HL7 standard query representation. The third best paper aligns with the overall theme of this Yearbook, the inclusivity of potential participants in clinical trials, with recommendations to ensure greater equity. The fourth proposes a multi-modal modelling approach for large scale phenotyping from electronic health record information. This year's survey paper has also examined equity, along with data bias, and found that the relevant publications in 2022 have focused almost exclusively on the issue of bias in Artificial Intelligence (AI).
Conclusions: The literature relevant to CRI in 2022 has largely been dominated by publications that seek to maximise the reusability of wide scale and representative electronic health record information for research, either as big data for distributed analysis or as a source of information from which to identify suitable patients accurately and equitably for invitation to participate in clinical trials.

@Article{Tannier2023, 
  title = {{Clinical Research Informatics: Contributions from 2022}},
  author = {Tannier, Xavier and Kalra, Dipak},
  year = {2023}, 
  month = dec, 
  journal = {Yearbook of Medical Informatics}, 
  volume = {32}, 
  doi = {10.1055/s-0043-1768748}
}

Anne-Cécile Normand, Aurélien Chaline, Noshine Mohammad, Alexandre Godmer, Aniss Acherar, Antoine Huguelin, Stéphane Ranque, Xavier Tannier, Renaud Piarroux.

Identification of a clonal population of Aspergillus flavus by MALDI-TOF mass spectrometry using deep learning.

Scientific Reports. Vol. 12, Issue 1575, January 2022. doi: 10.1038/s41598-022-05647-4

ⓘ [abstract] [BibTeX] [Ask me!] [Nature link]

The spread of fungal clones is hard to detect in the daily routines in clinical laboratories, and there is a need for new tools that can facilitate clone detection within a set of strains. Currently, Matrix Assisted Laser Desorption-Ionization Time-of-Flight Mass Spectrometry is extensively used to identify microbial isolates at the species level. Since most of clinical laboratories are equipped with this technology, there is a question of whether this equipment can sort a particular clone from a population of various isolates of the same species. We performed an experiment in which 19 clonal isolates of Aspergillus flavus initially collected on contaminated surgical masks were included in a set of 55 A. flavus isolates of various origins. A simple convolutional neural network (CNN) was trained to detect the isolates belonging to the clone. In this experiment, the training and testing sets were totally independent, and different MALDI-TOF devices (Microflex) were used for the training and testing phases. The CNN was used to correctly sort a large portion of the isolates, with excellent (> 93%) accuracy for two of the three devices used and with less accuracy for the third device (69%), which was older and needed to have the laser replaced.

@Article{Normand2022, 
  title = {{Identification of a clonal population of Aspergillus flavus by MALDI-TOF mass spectrometry using deep learning}},
  author = {Normand, Anne-Cécile and Chaline, Aurélien and Mohammad, Noshine and Godmer, Alexandre and Acherar, Aniss and Huguelin, Antoine and Ranque, Stéphane and Tannier, Xavier and Piarroux, Renaud},
  number = {1575}, 
  year = {2022}, 
  month = jan, 
  journal = {Scientific Reports}, 
  volume = {12}, 
  doi = {10.1038/s41598-022-05647-4}
}

Christel Daniel, Xavier Tannier, Dipak Kalra.

Clinical Research Informatics.

Yearbook of Medical Informatics. 31, 2022. doi: 10.1055/s-0042-1742530

ⓘ [abstract] [BibTeX] [Ask me!] [Thieme link]

Objectives: To summarize key contributions to current research in the field of Clinical Research Informatics (CRI) and to select best papers published in 2021.
Method: Using PubMed, we did a bibliographic search using a combination of MeSH descriptors and free-text terms on CRI, followed by a double-blind review in order to select a list of candidate best papers to be peer-reviewed by external reviewers. After peer-review ranking, three section editors met for a consensus meeting and the editorial team was organized to finally conclude on the selected three best papers.
Results: Among the 1,096 papers (published in 2021) returned by the search and in the scope of the various areas of CRI, the full review process selected three best papers. The first best paper describes an operational and scalable framework for generating EHR datasets based on a detailed clinical model with an application in the domain of the COVID-19 pandemics. The authors of the second best paper present a secure and scalable platform for the preprocessing of biomedical data for deep data-driven health management applied for the detection of pre-symptomatic COVID-19 cases and for biological characterization of insulin-resistance heterogeneity. The third best paper provides a contribution to the integration of care and research activities with the REDCap Clinical Data and Interoperability sServices (CDIS) module improving the accuracy and efficiency of data collection.
Conclusions: The COVID-19 pandemic is still significantly stimulating research efforts in the CRI field to improve the process deeply and widely for conducting real-world studies as well as for optimizing clinical trials, the duration and cost of which are constantly increasing. The current health crisis highlights the need for healthcare institutions to continue the development and deployment of Big Data spaces, to strengthen their expertise in data science and to implement efficient data quality evaluation and improvement programs.

@Article{Daniel2022, 
  title = {{Clinical Research Informatics}},
  author = {Daniel, Christel and Tannier, Xavier and Kalra, Dipak},
  year = {2022}, 
  journal = {Yearbook of Medical Informatics}, 
  volume = {31}, 
  doi = {10.1055/s-0042-1742530}
}

Cécile Nabet, Aurélien Chaline, Jean-François Franetich, Jean-Yves Brossas, Noémie Shahmirian, Olivier Silvie, Xavier Tannier, Renaud Piarroux.

Prediction of malaria transmission drivers in Anopheles mosquitoes using artificial intelligence coupled to MALDI-TOF mass spectrometry.

Scientific Reports. Vol. 10, Issue 1, July 2020. doi: 10.1038/s41598-020-68272-z

ⓘ [abstract] [BibTeX] [Nature link] [Ask me!]

Vector control programmes are a strategic priority in the fight against malaria. However, vector control interventions require rigorous monitoring. Entomological tools for characterizing malaria transmission drivers are limited and are difficult to establish in the field. To predict Anopheles drivers of malaria transmission, such as mosquito age, blood feeding and Plasmodium infection, we evaluated artificial neural networks (ANNs) coupled to matrix-assisted laser desorption ionization-time of flight (MALDI-TOF) mass spectrometry (MS) and analysed the impact on the proteome of laboratory-reared Anopheles stephensi mosquitoes. ANNs were sensitive to Anopheles proteome changes and specifically recognized spectral patterns associated with mosquito age (0–10 days, 11–20 days and 21–28 days), blood feeding and P. berghei infection, with best prediction accuracies of 73%, 89% and 78%, respectively. This study illustrates that MALDI-TOF MS coupled to ANNs can be used to predict entomological drivers of malaria transmission, providing potential new tools for vector control. Future studies must assess the field validity of this new approach in wild-caught adult Anopheles. A similar approach could be envisaged for the identification of blood meal source and the detection of insecticide resistance in Anopheles and to other arthropods and pathogens.

@Article{Nabet2020, 
  title = {{Prediction of malaria transmission drivers in Anopheles mosquitoes using artificial intelligence coupled to MALDI-TOF mass spectrometry}},
  author = {Nabet, Cécile and Chaline, Aurélien and Franetich, Jean-François and Brossas, Jean-Yves and Shahmirian, Noémie and Silvie, Olivier and Tannier, Xavier and Piarroux, Renaud},
  number = {1}, 
  year = {2020}, 
  month = jul, 
  journal = {Scientific Reports}, 
  volume = {10}, 
  doi = {10.1038/s41598-020-68272-z}
}

Jean Charlet, Xavier Tannier.

Élémentaire mon cher Watson ?.

in Journée IA et Santé. Angers, France, June 2020.

ⓘ [abstract] [BibTeX] [HAL link]

IBM’s Watson system has been buzzing for a few years. This buzz is not always to the benefit of the system,especially in medicine, where a Stat News article from February 2017 highlights Watson’s failure. In thisarticle, we try to analyze this failure and to compare Watson to what can be expected from AI systems, inparticular in terms of natural language processing in medicine.

Le système Watson de IBM fait le buzz depuis quelques années. Ce buzz n’est pas toujours àl’avantage du système, en particulier en médecine où un article de Stat News de février 2017 met en avantl’échec de Watson. Dans cet article, nous tentons d’analyser cet échec et le comparer à ce que l’on peutattendre des systèmes d’IA, en particulier par rapport aux techniques mises en œuvre dans Watson en termesde traitement automatique du langage en médecine.

@InProceedings{Charlet2020, 
  title = {{Élémentaire mon cher Watson ?}},
  author = {Jean Charlet and Xavier Tannier},
  booktitle = {Journée IA et Santé}, 
  address = {Angers, France}, 
  year = {2020}, 
  month = jun
}

Raphaël Bonaque, Tien-Duc Cao, Bogdan Cautis, François Goasdoué, Javier Letelier, Ioana Manolescu, Oscar Mendoza, Swen Ribeiro, Xavier Tannier, Michael Thomazo.

Mixed-instance querying: a lightweight integration architecture for data journalism.

in Proceedings of the Conference on Very Large Databases (Demonstrations Track, PVLDB 2016). New Delhi, India, September 2016.

ⓘ [abstract] [BibTeX] [free copy]

As the world’s affairs get increasingly more digital, timely production and consumption of news require to efficiently and quicklyexploit heterogeneous data sources. Discussions with journalistsrevealed that content management tools currently at their disposalfall very short of expectations. We demonstrate TATOOINE, a lightweight data integration prototype, which allows to quickly set upintegration queries across (very) heterogeneous data sources, capitalizing on the many data links (joins) available in this applicationdomain. Our demonstration is based on scenarios we study in collaboration with Le Monde, France’s major newspaper.

@InProceedings{Bonaque2016, 
  title = {{Mixed-instance querying: a lightweight integration architecture for data journalism}},
  author = {Bonaque, Raphaël and Cao, Tien-Duc and Cautis, Bogdan and Goasdoué, François and Letelier, Javier and Manolescu, Ioana and Mendoza, Oscar and Ribeiro, Swen and Tannier, Xavier and Thomazo, Michael},
  booktitle = {Proceedings of the Conference on Very Large Databases (Demonstrations Track, PVLDB 2016)}, 
  address = {New Delhi, India}, 
  year = {2016}, 
  month = sep
}

Michael Filhol, Xavier Tannier.

Construction of a French-LSF corpus.

in Proceedings of the 7th Workshop on Building and Using Comparable Corpora Building Resources for Machine Translation Research (BUCC 2014). Reykjavík, Iceland, May 2014.

ⓘ [abstract] [BibTeX] [paper]

In this article, we present the first academic comparable corpus involving written French and French Sign Language. After explaining our initial motivation to build a parallel set of such data, especially in the context of our work on Sign Language modelling and our prospect of machine translation into Sign Language, we present the main problems posed when mixing language channels and modalities (oral, written, signed), discussing the translation-vs-interpretation narrative in particular. We describe the process followed to guarantee feature coverage and exploitable results despite a serious cost limitation, the data being collected from professional translations. We conclude with a few uses and prospects of the corpus.

@InProceedings{Filhol2014a, 
  title = {{Construction of a French-LSF corpus}},
  author = {Michael Filhol and Xavier Tannier},
  booktitle = {Proceedings of the 7th Workshop on Building and Using Comparable Corpora Building Resources for Machine Translation Research (BUCC 2014)}, 
  address = {Reykjavík, Iceland}, 
  year = {2014}, 
  month = may
}

N. Campion, J. Closson, O. Ferret, R. Besançon, W. Wang, J. Shin, J.-M. Floret, B. Grau, X. Tannier, A.-D. Mezaour, J. -M. Lazard.

FILTRAR-S : Nouveaux développements.

in Actes du Workshop Sécurité Globale (WISG' 2011). Troyes, France, 2011.

ⓘ [abstract] [BibTeX] [paper]

The project named FILTRAR-S is supported by ANR and managed by UTT with the aim of improving the citizens safety. It consists in the programming and testing of a tool that automatically analyzes the semantic content of any kind of written texts. The tool computes statistics of word occurrences in a corpora and discovers a relevant set of semantic topics that are instantiated in the texts of the corpus. The topics are then used to classify the texts and to extract the relations between pairs of entities names in texts. Because of the unsupervised nature of the methods used, the set of topic categories and the extracted relations does not have to be previously known and can be specific to a corpus. Then, a module indexes the texts through the topics and relations that have been discovered in their contents. At last, a module relies on the built index to find relevant answers to questions typed by users. This article points out the main theoretical bases and goals of the project, describes the work done and the results obtained at the end of the second year of the project. It also mentions the future developments and user tests that are on the agenda.

Financé par l'ANR et administré par l'UTT avec pour objectif de contribuer à la sécurité du citoyen, le projet FILTRAR-S consiste à développer et à tester le démonstrateur d'un outil d'analyse automatique du contenu sémantique des textes écrits. L'outil utilise les statistiques d'occurrence des mots dans les documents textuels d'un corpus pour découvrir une classification thématique pertinente de leur contenu, et pour extraire des relations entre entités nommées qui sont présentes dans les documents. Du fait du caractère essentiellement non supervisé des méthodes utilisées, les catégories thématiques et les relations extraites des corpus n'on pas à être connues a priori. En outre, l'outil intègre un module d'indexation des topics et des relations découvertes dans les documents et un module de question/réponse pour la fouille du contenu des documents indexés. Cet article rappelle les principaux concepts et les objectifs du projet, présente les travaux effectués et les résultats obtenus à la fin de la deuxième année du projet, ainsi que les développements futurs.

@InProceedings{Campion11, 
  title = {{FILTRAR-S : Nouveaux développements}},
  author = {Campion, N. and Closson, J. and Ferret, O. and Besançon, R. and Wang, W. and Shin, J. and Floret, J.-M. and Grau, B. and Tannier, X. and Mezaour, A.-D. and   Lazard, J. -M.},
  booktitle = {Actes du Workshop Sécurité Globale (WISG' 2011)}, 
  address = {Troyes, France}, 
  year = {2011}
}

Nicolas Campion, Jacques Closson, Olivier Ferret, Dhafer Lahbib, Romaric Besançon, Jin Shin, Jean-Marc Floret, Brigitte Grau, Xavier Tannier, Jean-Marc Lazard, Amar-Djalil Mezaour.

FILTRAR-S : un outil de filtrage sémantique et de fouille de textes pour la veille.

in Acte du Colloque international Veille Stratégique Scientifique & Technologique (VVST'2010). Toulouse, France, October 2010.

ⓘ [abstract] [BibTeX] [paper]

Actuellement développé et testé avec le soutien financier de l'ANR, FILTRAR-S est un outil d'analyse sémantique automatique de textes écrits qui combine les fonctions de filtrage, d'indexation et de fouille sur les documents indexés. Le filtrage est réalisé par l'extraction inductive de structures sémantiques associatives et conduit à l'indexation thématique du contenu des documents. La fouille du contenu des textes prend en compte une dimension factuelle par l'extraction non supervisée des relations entre entités nommées. L'interrogation en langage naturel est prise en charge par un système de question-réponse qui utilise les indexations thématiques et factuelles des textes produites par le système. Pour le filtrage, un objectif de modélisation du fonctionnement associatif de la mémoire sémantique nous a conduit à utiliser l'algorithme LDA du Topic Model. Les résultats d'une première expérimentation pour le traitement d'un corpus d'articles du journal "Le Monde" montre l'efficacité du système pour l'extraction des topics et l'indexation par topics des documents du corpus. Du point de vue de la fouille fondée sur la dimension factuelle des textes, une première expérimentation sur un corpus d'articles du "New York Times" donne également des résultats intéressants. Les développements en cours visent à mettre en ½uvre des procédures de calcul de similarité sémantique entre le contenu d'un texte et celui de profils thématiques et factuels, ainsi qu'un fonctionnement interactif des modules, notamment en réponse à une question précise de l'utilisateur. Il faut enfin souligner que si FILTRAR-S est d'abord développé à des fins de sécurité et de protection du citoyen, les fonctionnalités dont il se dote pour la recherche d'information et pour la veille technologique intéressent des domaines divers.

@INPROCEEDINGS{Campion10, 
  title = {{FILTRAR-S : un outil de filtrage sémantique et de fouille de textes pour la veille}},
  author = {Nicolas Campion and Jacques Closson and Olivier Ferret and Dhafer Lahbib and Romaric Besançon and Jin Shin and Jean-Marc Floret and Brigitte Grau and Xavier Tannier and Jean-Marc Lazard and Amar-Djalil Mezaour},
  booktitle = {Acte du Colloque international Veille Stratégique Scientifique & Technologique (VVST'2010)}, 
  address = {Toulouse, France}, 
  year = {2010}, 
  month = oct
}

Xavier Tannier.

Se protéger sur Internet. Conseils pour la vie en ligne.

Eyrolles, August 2010. ISBN 978-2-212-12774-4.

ⓘ [BibTeX] [abstract] [Eyrolles link]

FIDJI is an open-domain question-answering system for French. The main goal is to validate answers by checking that all the information given in the question is retrieved in the supporting texts. This paper presents FIDJI's results at ResPubliQA 2009, as well as additional experiments bringing to light the role of linguistic modules in this particular campaign.

@Book{Tannier2010c, 
  title = {{Se protéger sur Internet. Conseils pour la vie en ligne}},
  author = {Xavier Tannier},
  year = {2010}, 
  month = aug, 
  publisher = {Eyrolles}
}

Publications

Information Extraction

Biomedical NLP

Content Management

Health Data Science (other than NLP)

Digital Humanities and NLP for journalism

Question Answering

Information Retrieval (misc)

Natural Language Processing (misc)

Other subjects