19th Workshop on Building and Using Comparable Corpora

Please note that the program uses Mallorca time, i.e., GMT+2 (CEST).

Program: Monday, 11 May 2026

Room: Calvia, 1st floor

Zoom link: https://zoom.us/j/92983398859?pwd=gErMaK7bfqVgkSIaXq9Zb27bzItfnF.1

Monday, May 11, 2026
9:00    Session 1
Chair: Ayla Rigouts Terryn, Université de Montréal
9:00    Introduction
9:06    Keynote: The Cross-Lingual Transfer Myth: Why Modern LLMs Still Fail Without Comparable Corpora and Representations
Els Lefever
LT3, Ghent University
10:06    A Comparative Study of Parkinsonian Speech Corpora for Deep Learning-Based Detection of Dysarthria
Clara Ponchard and Pierre Serrano
Inria
10:30    Coffee break
11:00    Session 2: Comparable corpora for linguistics research
Chair: Philippe Langlais, Université de Montréal
11:00    Computing Semantic Similarity for Aligning Bilingual Semi-parallel Texts: A Case Study
Steffen Frenzel, Maximilian Krupop, Manfred Stede
University of Potsdam
11:24    A Comparative Study in Corpus Linguistics Applied to Automatic Terminology Extraction
Mercè Vàzquez1, Sergi Alvarez-Vidal2, Antoni Oliver1
1Universitat Oberta de Catalunya, 2Universitat Autònoma de Barcelona
11:48    Comparable Corpora in Cross-linguistic Research: Nominal Number in English, Czech, and Greek
Konstantinos Diamantopoulos and Magda Ševčíková
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics
12:12    Liebe Kolleg:innen, Querid@s Compañer@s: Presenting the GILDEES Corpus
Marie-Pauline Krielke
Saarland University
12:36    A Diachronic Comparable Corpus of Spanish Digital News (2017–2026) for the Study of Stylistic Convergence in the GenAI Era
Hugo Sanjurjo-González
University of Deusto
13:00    Lunch break
14:00    Session 3: Synthetic corpora
Chair: Serge Sharoff, University of Leeds
14:00    Panel discussion: Comparable in the Age of LLMs: Fundamental questions at the intersection of comparable corpora and synthetic data (Chair: Serge Sharoff, University of Leeds)

Panelists:
Cristina España-Bonet (DFKI, Saarbrücken, Germany, and Barcelona Supercomputing Center, Barcelona, Spain)
Nizar Habash (NYU Abu Dhabi, UAE)
Philippe Langlais (Université de Montréal, Montréal, Canada)
Benoît Sagot (Inria, Paris, France)
15:12    Align and Shine: Building High-quality Sentence-aligned Corpora for Multilingual Text Simplification
Luis Kenji Hilasaca Sanchez, Nouran Khallaf, Serge Sharoff
University of Leeds
15:36    Bi-Text Mining across German Dialects: On the Role of Synthetic Training Data for Dialect Adaptation
Jing Wang, Barbara Plank, Robert Litschko
LMU Munich
16:00    Coffee break
16:30    Session 4: Building comparable datasets
Chair: Pierre Zweigenbaum, Université Paris-Saclay, CNRS
16:30    Parallel Corpora of Scholarly Documents for English-French Machine Translation
Ziqian Peng1, Lichao Zhu2, Rachel Bawden3, Maud Bénard2, Éric de la Clergerie3, Mathilde Huguin4, Natalie Kübler2, Paul Lerner5, Alexandra Mestivier2, François Yvon5
1Sorbonne Université, CNRS, ISIR & Inria, Paris, 2Université Paris Cité, ALTAE, 3Inria, 4CNRS, 5Sorbonne Université, CNRS, ISIR
16:54    Validating a Pipeline to Create a Comparable Corpus of Government-Issued Travel Advisories from the Internet Archives
Laura Braun and Christian Oswald
University of the German Federal Armed Forces
17:18    Leveraging Comparable Toxicity Lexicons in Prompt Instructions for Multilingual Text Detoxification
Yassir El Attar, Esra Dönmez, Nina K. Ohlendorf, Agnieszka Falenska
IMS, University of Stuttgart
17:42    Closing words
18:00    End of workshop
Last modified: 11 May 2026, 9:50