19th Workshop on Building and Using Comparable Corpora

Please note that the program uses Mallorca time, i.e., GMT+2 (CEST).

Program: Monday, 11 May 2026

Room: Calvia, 1st floor

Zoom link: https://zoom.us/j/92983398859?pwd=gErMaK7bfqVgkSIaXq9Zb27bzItfnF.1

Monday, May 11, 2026

9:00	Session 1 Chair: Ayla Rigouts Terryn, Université de Montréal

9:00	Introduction

9:06	Keynote: The Cross-Lingual Transfer Myth: Why Modern LLMs Still Fail Without Comparable Corpora and Representations Els Lefever LT3, Ghent University
10:06	A Comparative Study of Parkinsonian Speech Corpora for Deep Learning-Based Detection of Dysarthria Clara Ponchard and Pierre Serrano Inria

10:30

Coffee break

11:00

Session 2: Comparable corpora for linguistics research
Chair: Philippe Langlais, Université de Montréal

11:00	Computing Semantic Similarity for Aligning Bilingual Semi-parallel Texts: A Case Study Steffen Frenzel, Maximilian Krupop, Manfred Stede University of Potsdam
11:24	A Comparative Study in Corpus Linguistics Applied to Automatic Terminology Extraction Mercè Vàzquez¹, Sergi Alvarez-Vidal², Antoni Oliver¹ ¹Universitat Oberta de Catalunya, ²Universitat Autònoma de Barcelona
11:48	Comparable Corpora in Cross-linguistic Research: Nominal Number in English, Czech, and Greek Konstantinos Diamantopoulos and Magda Ševčíková Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics
12:12	Liebe Kolleg:innen, Querid@s Compañer@s: Presenting the GILDEES Corpus Marie-Pauline Krielke Saarland University
12:36	A Diachronic Comparable Corpus of Spanish Digital News (2017–2026) for the Study of Stylistic Convergence in the GenAI Era Hugo Sanjurjo-González University of Deusto

13:00

Lunch break

14:00

Session 3: Synthetic corpora
Chair: Serge Sharoff, University of Leeds

14:00

Panel discussion: Comparable in the Age of LLMs: Fundamental questions at the intersection of comparable corpora and synthetic data (Chair: Serge Sharoff, University of Leeds)

Panelists:
Cristina España-Bonet (DFKI, Saarbrücken, Germany, and Barcelona Supercomputing Center, Barcelona, Spain)
Nizar Habash (NYU Abu Dhabi, UAE)
Philippe Langlais (Université de Montréal, Montréal, Canada)
Benoît Sagot (Inria, Paris, France)

15:12	Align and Shine: Building High-quality Sentence-aligned Corpora for Multilingual Text Simplification Luis Kenji Hilasaca Sanchez, Nouran Khallaf, Serge Sharoff University of Leeds
15:36	Bi-Text Mining across German Dialects: On the Role of Synthetic Training Data for Dialect Adaptation Jing Wang, Barbara Plank, Robert Litschko LMU Munich

16:00

Coffee break

16:30

Session 4: Building comparable datasets
Chair: Pierre Zweigenbaum, Université Paris-Saclay, CNRS

16:30	Parallel Corpora of Scholarly Documents for English-French Machine Translation Ziqian Peng¹, Lichao Zhu², Rachel Bawden³, Maud Bénard², Éric de la Clergerie³, Mathilde Huguin⁴, Natalie Kübler², Paul Lerner⁵, Alexandra Mestivier², François Yvon⁵ ¹Sorbonne Université, CNRS, ISIR & Inria, Paris, ²Université Paris Cité, ALTAE, ³Inria, ⁴CNRS, ⁵Sorbonne Université, CNRS, ISIR
16:54	Validating a Pipeline to Create a Comparable Corpus of Government-Issued Travel Advisories from the Internet Archives Laura Braun and Christian Oswald University of the German Federal Armed Forces
17:18	Leveraging Comparable Toxicity Lexicons in Prompt Instructions for Multilingual Text Detoxification Yassir El Attar, Esra Dönmez, Nina K. Ohlendorf, Agnieszka Falenska IMS, University of Stuttgart

17:42

Closing words

18:00

End of workshop

Last modified: 11 May 2026, 9:50