Slovak Language in the Era of Large Language Models (with the Support of the Leonardo Supercomputer)
You are warmly invited to a joint webinar on language modeling, organized by the National Competence Centres for HPC in Slovakia and Italy. The rise of large language models (LLMs), which require vast amounts of training data, initially put users of low-resource languages at a disadvantage.
As part of our project, we are working to overcome this barrier for the Slovak language through several strategies that may also offer methodological insights for other low-resource languages:
- Generating Bilingual Datasets: Using a carefully curated database of professionally edited Slovak books, we employ the LLaMA 3.3 70B Instruct model to translate texts into English and then back into Slovak. This process allows us to create two datasets—one for training a compact open-source model for English-to-Slovak translation, and another for improving the quality of machine-translated Slovak.
- Summarizing Scientific Texts: Using Gemini Flash Experimental and the PLOS scientific database, we generate summaries of scientific articles in Slovak. This dataset supports the training of Slovak LLMs in the area of specialized scientific terminology.
- Enhancing Cultural Context: Although models like DeepSeek and ChatGPT perform relatively well in Slovak, they struggle with culturally specific and contextual topics related to Slovakia. We plan to synthesize texts from Slovak sources to create a dataset that fills this gap.
Date and Time: June 11, 2025, 10:00 – 11:00 CEST
Venue: online
Language: English
Speaker: Marek Dobeš
Co-authors: Radovan Garabík and Peter Bednár
Registration
Our aim is to mitigate the data scarcity for the Slovak language and enhance the performance of LLMs in terms of linguistic accuracy, scientific discourse, and cultural relevance. We believe that the approaches explored in this case study may inspire similar efforts for other low-resource languages.
This research is conducted on high-performance infrastructure — specifically, the Slovak national supercomputer Devana and. Leonardo one of Europe’s most powerful supercomputers operated by Cineca in Italy. These platforms enable us to process multilingual datasets, train models at scale, and test advanced LLM techniques with resource efficiency.
Although our case study focuses on Slovak, the methods and tools we are developing are broadly applicable to other underrepresented languages around the world. We warmly invite collaborators from all countries — not only from Central Europe or Italy, but from any region where a lack of language data poses a barrier to AI development. Our project demonstrates how European collaboration and shared use of supercomputing resources can open up new possibilities for inclusive, multilingual language modeling — especially for countries that have so far had limited opportunities to contribute to the creation of multilingual language models.