19th Workshop on Building and Using Comparable Corpora

Program: Monday, 11 May 2026

Monday, May 11, 2026
9:00    Session 1
Chair: Ayla Rigouts Terryn, Université de Montréal
9:00    Introduction
9:06    Keynote: The Cross-Lingual Transfer Myth: Why Modern LLMs Still Fail Without Comparable Corpora and Representations
Els Lefever
LT3, Ghent University
10:06    A Comparative Study of Parkinsonian Speech Corpora for Deep Learning-Based Detection of Dysarthria
Clara Ponchard and Pierre Serrano
Inria
10:30    Coffee break
11:00    Session 2: Comparable corpora for linguistics research
Chair: Philippe Langlais, Université de Montréal
11:00    Computing Semantic Similarity for Aligning Bilingual Semi-parallel Texts: A Case Study
Steffen Frenzel, Maximilian Krupop, Manfred Stede
University of Potsdam
11:24    A Comparative Study in Corpus Linguistics Applied to Automatic Terminology Extraction
Mercè Vàzquez1, Sergi Alvarez-Vidal2, Antoni Oliver1
1Universitat Oberta de Catalunya, 2Universitat Autònoma de Barcelona
11:48    Comparable Corpora in Cross-linguistic Research: Nominal Number in English, Czech, and Greek
Konstantinos Diamantopoulos and Magda Ševčíková
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics
12:12    Liebe Kolleg:innen, Querid@s Compañer@s: Presenting the GILDEES Corpus
Marie-Pauline Krielke
Saarland University
12:36    A Diachronic Comparable Corpus of Spanish Digital News (2017–2026) for the Study of Stylistic Convergence in the GenAI Era
Hugo Sanjurjo-González
University of Deusto
13:00    Lunch break
14:00    Session 3: Synthetic corpora
Chair: Serge Sharoff, University of Leeds
14:00    Panel discussion: How comparable are synthetic data?
15:12    Align and Shine: Building High-quality Sentence-aligned Corpora for Multilingual Text Simplification
Luis Kenji Hilasaca Sanchez, Nouran Khallaf, Serge Sharoff
University of Leeds
15:36    Bi-Text Mining across German Dialects: On the Role of Synthetic Training Data for Dialect Adaptation
Jing Wang1, Barbara Plank2, Robert Litschko2
1Ludwig-Maximilians-Universität München, 2LMU Munich
16:00    Coffee break
16:30    Session 4: Building comparable datasets
Chair: Pierre Zweigenbaum, Université Paris-Saclay, CNRS
16:30    Parallel Corpora of Scholarly Documents for English-French Machine Translation
Ziqian Peng1, Lichao Zhu2, Rachel Bawden3, Maud Bénard2, Éric de la Clergerie3, Mathilde Huguin4, Natalie Kübler2, Paul Lerner5, Alexandra Mestivier2, François Yvon5
1Sorbonne Université, CNRS, ISIR & Inria, Paris, 2Université Paris Cité, ALTAE, 3Inria, 4CNRS, 5Sorbonne Université, CNRS, ISIR
16:54    Validating a Pipeline to Create a Comparable Corpus of Government-Issued Travel Advisories from the Internet Archives
Laura Braun and Christian Oswald
University of the German Federal Armed Forces
17:18    Leveraging Comparable Toxicity Lexicons in Prompt Instructions for Multilingual Text Detoxification
Yassir El Attar, Esra Dönmez, Nina K. Ohlendorf, Agnieszka Falenska
IMS, University of Stuttgart
17:42    Closing words
18:00    End of workshop
Last modified: 14 Apr 2026