Success story: When a production line knows what will happen in 10 minutes
Every disruption on a production line creates stress. Machines stop, people wait, production slows down, and decisions must be made under pressure. In the food industry—especially in the production of filled pasta products, where the process follows a strictly sequential set of technological steps—one unexpected issue at the end of the line can bring the entire production flow to a halt.
But what if the production line could warn in advance that a problem will occur in a few minutes? Or help decide, already during a shift, whether it still makes sense to plan packaging later the same day? These were exactly the questions that stood at the beginning of a research collaboration that brought together industrial data, artificial intelligence, and supercomputing power.
The research was carried out by an international team of experts in artificial intelligence and industrial analytics from both academia and the private sector. The project involved the company Prouniona.s. in cooperation with Constantine the Philosopher University in Nitra, as well as additional academic partners from the Czech Republic and Hungary.
Challenge
Modern production lines generate enormous volumes of data—from machine states and operating speeds to temperatures and production counts. Despite this, key operational decisions are still often made based on experience and intuition.
The researchers focused on a real production line for filled pasta products, where the product passes through a fixed sequence of machines—from raw material preparation, through forming and filling, to thermal processing and packaging. They identified two decisions with a critical impact on production efficiency:
Early warning: Is it possible to predict whether the packaging machine will stop within the next 10 minutes?
In-shift planning: Can it be reliably determined during the working day whether packaging will still take place later the same day?
Answering these questions required working with large volumes of time-series data while strictly respecting real production conditions—models were allowed to use only the information that is genuinely available at a given moment to an operator or shift supervisor.
Solution
The research team first unified data from all machines into a single time axis and processed it to accurately reflect the real operation of the production line. They then developed machine-learning models that worked exclusively with information available at the given moment—exactly as an operator or shift manager would have it in practice.
A key milestone of the project was access to high-performance computing resources. NSCC Slovakia facilitated access for the research team to the European EuroHPCsupercomputing infrastructure, specifically to the Karolina supercomputer in the Czech Republic. This made it possible to rapidly experiment with different models, test them on real production days, and validate their behavior under conditions close to real industrial practice.
The supercomputer thus became not just a technical tool, but a key driver of innovation, enabling the transition from theoretical analytics to decisions that can be used in real operations.
Results
The model focused on early warning of packaging machine stoppages achieved very high accuracy. It was able to reliably identify situations in which a stoppage was likely within the next 10 minutes, while keeping the number of false alarms to a minimum. This means the alerts are trustworthy and do not overwhelm operators with unnecessary warnings.
The second model, designed for in-shift planning, was able with high reliability to determine whether packaging would still take place later the same day. Managers thus gained a practical basis for decisions related to staffing, work planning, and efficient use of time.
Both approaches share a common principle: they do not predict abstract numbers, but instead answer concrete questions that production teams face every day.
Impact and future potential
This success story shows that artificial intelligence in industry does not have to be a futuristic experiment. When analytics is focused on real operational decisions and supported by the right infrastructure, it can become a quiet and reliable assistant to production.
The solution is easily extendable to other production lines and sectors. Looking ahead, additional data—such as product types, planned maintenance, or shift schedules—can be integrated, allowing models to be even more precisely tailored to the specific needs of companies.
The key message is clear: When data, artificial intelligence, and supercomputers are aligned with real industrial needs, the result is solutions with immediate practical value.
BeeGFS in Practice — Parallel File Systems for HPC, AI and Data-Intensive Workloads6 Feb-This webinar introduces BeeGFS, a leading parallel file system designed to support demanding HPC, AI, and data-intensive workloads. Experts from ThinkParQ will explain how parallel file systems work, how BeeGFS is architected, and how it is used in practice across academic, research, and industrial environments.
When a production line knows what will happen in 10 minutes5 Feb-Every disruption on a production line creates stress. Machines stop, people wait, production slows down, and decisions must be made under pressure. In the food industry—especially in the production of filled pasta products, where the process follows a strictly sequential set of technological steps—one unexpected issue at the end of the line can bring the entire production flow to a halt. But what if the production line could warn in advance that a problem will occur in a few minutes? Or help decide, already during a shift, whether it still makes sense to plan packaging later the same day? These were exactly the questions that stood at the beginning of a research collaboration that brought together industrial data, artificial intelligence, and supercomputing power.
Who Owns AI Inside an Organisation? — Operational Responsibility5 Feb-This webinar focuses on how organisations can define clear operational responsibility and ownership of AI systems in a proportionate and workable way. Drawing on hands-on experience in data protection, AI governance, and compliance, Petra Fernandes will explore governance approaches that work in practice for both SMEs and larger organisations. The session will highlight internal processes that help organisations stay in control of their AI systems over time, without creating unnecessary administrative burden.
Fear of breast cancer is a silent companion for many women. All it takes is an invitation to a preventive screening, a single phone call from a doctor, or the wait for test results—and the mind fills with questions: “Am I okay?” “What if I’m not?” “Could something be missed?” Even when screening confirms a negative result, the worries often persist.
That is precisely why it makes sense to seek new ways to detect cancer as early as possible—not to replace doctors, but to help them see more, faster, and with greater confidence. And this is where artificial intelligence enters the story. Not as a sci-fi technology, but as a tool that may one day help protect lives.
A Slovak research team from the University of Žilina has brought together medicine, artificial intelligence, and European supercomputers in a joint project with a clear goal: to improve the accuracy of breast cancer detection and support doctors in the interpretation of mammographic images.
Challenge
Mammography generates enormous volumes of imaging data. A single project may work with hundreds of thousands of images at extremely high resolution. The Slovak team from the University of Žilina worked with more than 434,000 mammograms, representing data on the scale of several terabytes.
At the same time, the team decided to use a foundation model—a massive neural network with nearly a billion parameters, originally developed for general image analysis. Such a model has enormous potential, but it also places extreme demands on computing power, memory, and data processing speed.
It quickly became clear that standard research infrastructure was simply not sufficient for such a volume of computations. Without a supercomputer, the project could not have continued.
Solution
The breakthrough came when the project gained access to the AI Factory VEGA in Slovenia, which is part of the European EuroHPC initiative. For the first time, Slovak medical AI research was able to work on infrastructure with a level of performance it had never had access to before.
On this platform, state-of-the-art NVIDIA H100 graphics accelerators, designed specifically for artificial intelligence, were available. The researchers built a complete technological pipeline there, from processing mammographic images to training the model itself.
First, the data had to be cleaned, optimized, and prepared so it could be loaded efficiently during computation. Then the process of adapting the large AI model began, as it “learned” to understand the subtle details of mammography. This was not a one-off computation—it was an incremental process in which the model improved step by step.
The supercomputer thus became not only a powerful tool but a key partner in research. It made it possible to do what was previously virtually impossible: to train a massive medical AI model at once using an enormous volume of data.
Results
Researchers have shown that artificial intelligence can learn from mammographic images in a way that gradually enables it to distinguish between healthy tissue and changes that may signal a problem. In other words, the system began to learn how to “look” at images in a manner similar to a physician—searching for subtle details and small deviations that can be very difficult for the human eye to notice.
This progress is particularly important because it represents the first step toward enabling artificial intelligence to flag changes that a human might not notice at first glance. It is not about replacing the physician, but about providing a supporting tool that can help clinicians make decisions with greater confidence, especially in borderline and ambiguous cases.
Impact and future potential
If this research continues to be further developed, artificial intelligence could become a silent assistant in preventive screening. It can speed up the evaluation of imaging data, reduce the risk of overlooking subtle changes, and help detect disease at a stage when it is still highly treatable.
Pre ženy to v praxi znamená väčšiu šancu na včasné odhalenie rakoviny a tým aj vyššiu nádej na úplné uzdravenie. Pri negatívnych nálezoch môžu dostať ženy nezávislý a objektívny doplnkový názor a tým si znížia neistotu po skríningu. Hoci je pred vedcami ešte ďalšia práca, už dnes je jasné, že smer, ktorým sa výskum uberá, má veľký zmysel. Cieľ je jednoduchý, ale silný. Využiť moderné technológie tak, aby pomáhali chrániť zdravie a životy žien.
BeeGFS in Practice — Parallel File Systems for HPC, AI and Data-Intensive Workloads6 Feb-This webinar introduces BeeGFS, a leading parallel file system designed to support demanding HPC, AI, and data-intensive workloads. Experts from ThinkParQ will explain how parallel file systems work, how BeeGFS is architected, and how it is used in practice across academic, research, and industrial environments.
When a production line knows what will happen in 10 minutes5 Feb-Every disruption on a production line creates stress. Machines stop, people wait, production slows down, and decisions must be made under pressure. In the food industry—especially in the production of filled pasta products, where the process follows a strictly sequential set of technological steps—one unexpected issue at the end of the line can bring the entire production flow to a halt. But what if the production line could warn in advance that a problem will occur in a few minutes? Or help decide, already during a shift, whether it still makes sense to plan packaging later the same day? These were exactly the questions that stood at the beginning of a research collaboration that brought together industrial data, artificial intelligence, and supercomputing power.
Who Owns AI Inside an Organisation? — Operational Responsibility5 Feb-This webinar focuses on how organisations can define clear operational responsibility and ownership of AI systems in a proportionate and workable way. Drawing on hands-on experience in data protection, AI governance, and compliance, Petra Fernandes will explore governance approaches that work in practice for both SMEs and larger organisations. The session will highlight internal processes that help organisations stay in control of their AI systems over time, without creating unnecessary administrative burden.
High-Performance Computing (HPC) offers researchers the ability to process enormous volumes of data and uncover connections that would otherwise remain hidden. Today, it is no longer just a tool for technical disciplines – it is increasingly valuable in social and environmental research as well. A great example is a project that harnessed the power of HPC to gain deeper insight into the relationship between humans, soil, and the landscape.
Challenge
Soil represents one of the most valuable resources we have — not only as a space for cultivation and economic activity, but also as a foundation of cultural identity, social relations, and quality of life. The way we use land is changing faster than ever before. The pressures of climate change, infrastructure development, housing demands, and renewable energy expansion are creating new tensions between economic interests, landscape protection, and the public good.
The foundation of fair and sustainable decision-making is participation — involving people in the processes that shape the land and environment they live in. However, if such processes are not well designed, they can lead to distrust, conflicts, and short-sighted solutions.
The research team from the Slovak University of Agriculture in Nitra therefore sought a way to capture, analyse, and connect these diverse perspectives. Their goal was to understand soil as a form of social and cultural capital — a space that brings together economic, environmental, and human values. To achieve this, they needed to process extensive datasets reflecting public discussions, attitudes, and values related to land and soil across the European context.
Solution
To better understand how different stakeholders perceive soil and its value, the team combined data analytics with participatory approaches. During the testing phase, they processed extensive textual data, expert documents, media outputs, and public statements that reflect societal attitudes toward soil and the landscape.
The team applied text mining methods to process the data, enabling the identification of recurring themes, linguistic patterns, and emotional attitudes related to land use. This approach opens the door to new insights, allowing researchers to derive from data how opinions are formed, where tensions arise, and what values people associate with the landscapes they inhabit.
The goal of the research is not merely to collect information, but to transform it into actionable insights that help build consensus among the public, experts, and policymakers.
Use of HPC Infrastructure
The analysis of such extensive textual data required computational power beyond the capabilities of standard workstations. Therefore, the research team used the computing infrastructure provided by NSCC Slovakia to carry out the data processing.
In the testing phase, the computations were performed on a supercomputer using 128 core*h in an R environment, enabling parallel processing of large datasets within a short time. This approach significantly reduced the analysis time while allowing the application of complex methodological frameworks typical for social and environmental data — such as modelling relationships between actors, tracking the occurrence of key concepts, and visualizing linguistic patterns.
Thanks to HPC computing, it was possible to:
process extensive text files from various sources without capacity limitations,
generate clear and structured data outputs that would take several times longer to produce on standard computers,
test the potential of the supercomputer for social science and interdisciplinary research that connects human behaviour, data, and spatial relationships.
Results
The test computations confirmed that the use of high-performance computing infrastructure enables efficient processing and analysis of extensive textual data originating from various social, environmental, and cultural sources. By applying text mining methods, the team was able to gain insights into key themes and the relationships between different stakeholders involved in land-use decision-making.
The analysis revealed significant differences in how various groups perceive soil and the landscape — whether in terms of economic, ecological, or value-based priorities. These insights help identify areas where misunderstandings and conflicts arise, while also highlighting shared values that can serve as a foundation for constructive dialogue.
The research confirmed that the use of HPC infrastructure significantly improves data processing efficiency and enables complex analyses to be carried out in a timeframe that would be unfeasible with standard computing resources. This established a reliable foundation for the main phase of the project, in which the results of the testing stage will be expanded with new data sources and methodological approaches.
The obtained results represent the first step toward developing a tool capable of linking quantitative data with social contexts — thereby contributing to a deeper understanding of the relationship between people, the landscape, and decisions regarding its use.
Impact and future:
The project confirmed that a high-performance computing environment provides significant benefits for social science and environmental research dealing with complex, unstructured data. The combination of social research and computational analytics has created a new approach that can be used to gain a deeper understanding of the relationship between humans, the landscape, and societal change.
From a methodological perspective, the project serves as a model example of how HPC can support interdisciplinary research that integrates environmental data, text corpora, legislation, and public discourse. Such an approach holds great potential within European initiatives focused on sustainable land management and landscape planning.
The results thus create a transferable framework that can be applied in both European and national projects — ranging from public policy research and participatory planning to the assessment of the social impacts of environmental decisions.
Data today can tell stories that we could not have captured just a few years ago. The research team harnessed the computational power of a supercomputer to analyse vast textual datasets in order to better understand how society perceives soil, landscape, and their value. The project demonstrates that the future of soil is hidden in data — and that high-performance computing can support not only scientists but also communities striving to find balance between development and sustainability.
BeeGFS in Practice — Parallel File Systems for HPC, AI and Data-Intensive Workloads6 Feb-This webinar introduces BeeGFS, a leading parallel file system designed to support demanding HPC, AI, and data-intensive workloads. Experts from ThinkParQ will explain how parallel file systems work, how BeeGFS is architected, and how it is used in practice across academic, research, and industrial environments.
When a production line knows what will happen in 10 minutes5 Feb-Every disruption on a production line creates stress. Machines stop, people wait, production slows down, and decisions must be made under pressure. In the food industry—especially in the production of filled pasta products, where the process follows a strictly sequential set of technological steps—one unexpected issue at the end of the line can bring the entire production flow to a halt. But what if the production line could warn in advance that a problem will occur in a few minutes? Or help decide, already during a shift, whether it still makes sense to plan packaging later the same day? These were exactly the questions that stood at the beginning of a research collaboration that brought together industrial data, artificial intelligence, and supercomputing power.
Who Owns AI Inside an Organisation? — Operational Responsibility5 Feb-This webinar focuses on how organisations can define clear operational responsibility and ownership of AI systems in a proportionate and workable way. Drawing on hands-on experience in data protection, AI governance, and compliance, Petra Fernandes will explore governance approaches that work in practice for both SMEs and larger organisations. The session will highlight internal processes that help organisations stay in control of their AI systems over time, without creating unnecessary administrative burden.
Supercomputer for Everyone: Dare to Discover the World of Modern Computing
Once, supercomputers were a mysterious technology accessible only to top scientists working in futuristic laboratories. Today, however, a completely new story is being written. Supercomputers are now available to ordinary users — from universities, small companies, and even public administration — anyone who needs to handle computations far beyond the capabilities of a regular computer.
Researchers have prepared a simple user guide that explains, step by step, how to access available computing power. They did it themselves with the aim of helping anyone who wants to process large datasets, train artificial intelligence, model natural phenomena, or create new technological solutions. Just register, obtain a project, and you can explore, invent, and tackle your boldest ideas.
There’s no reason to be afraid.
You can think of a supercomputer as an extremely powerful machine with thousands of “brains” working together. It’s not sitting in your office or glowing under your desk — it’s housed in a specialized data center, and you control it conveniently through a web browser.
You simply prepare your task and submit it to the system. While the supercomputer gets to work, you can relax and enjoy a cup of coffee. Within minutes or hours, you’ll receive results that would take your laptop weeks to compute — or that it might not be able to handle at all.
Who can it help?
- students processing large amounts of data - scientists testing new artificial intelligence algorithms - meteorologists working on weather forecasting - designers and engineers running simulations and developing new solutions - doctors and biologists analyzing genomes or medical data - small innovative companies without their own computing infrastructure
And many other fields are waiting for someone brave enough to explore them.
Why is it important?
We need a new impulse for innovation. We have smart people, bold ideas, and now also a tool that saves time, money, and opens the path to world-class results. The supercomputer is here to accelerate scientific progress and drive economic growth.
The first webinar coming soon
The authors of the guide are preparing a practical webinar designed for complete beginners. We’ll show that access to supercomputing is truly within everyone’s reach — for anyone unafraid to explore new possibilities. The goal is to spark curiosity and break down the barriers between technology and its users.
BeeGFS in Practice — Parallel File Systems for HPC, AI and Data-Intensive Workloads6 Feb-This webinar introduces BeeGFS, a leading parallel file system designed to support demanding HPC, AI, and data-intensive workloads. Experts from ThinkParQ will explain how parallel file systems work, how BeeGFS is architected, and how it is used in practice across academic, research, and industrial environments.
When a production line knows what will happen in 10 minutes5 Feb-Every disruption on a production line creates stress. Machines stop, people wait, production slows down, and decisions must be made under pressure. In the food industry—especially in the production of filled pasta products, where the process follows a strictly sequential set of technological steps—one unexpected issue at the end of the line can bring the entire production flow to a halt. But what if the production line could warn in advance that a problem will occur in a few minutes? Or help decide, already during a shift, whether it still makes sense to plan packaging later the same day? These were exactly the questions that stood at the beginning of a research collaboration that brought together industrial data, artificial intelligence, and supercomputing power.
Who Owns AI Inside an Organisation? — Operational Responsibility5 Feb-This webinar focuses on how organisations can define clear operational responsibility and ownership of AI systems in a proportionate and workable way. Drawing on hands-on experience in data protection, AI governance, and compliance, Petra Fernandes will explore governance approaches that work in practice for both SMEs and larger organisations. The session will highlight internal processes that help organisations stay in control of their AI systems over time, without creating unnecessary administrative burden.
Slovak scientists join forces in the fight against staphylococcal infection
Bacteria are among the smallest yet most dangerous adversaries in medicine. While some are harmless, others can cause serious infections where early diagnosis is crucial for successful treatment. A team of Slovak scientists from the Slovak Academy of Sciences is therefore exploring how to detect the presence of bacteria directly in tissue—quickly, accurately, and without the need for invasive procedures. Their research combines confocal Raman microscopy, photodynamic therapy, and data analysis using a supercomputer.
Challenge: Recognizing whether tissue is infected with bacteria is not always straightforward. In the early stages of infection, the differences between healthy and damaged cells often cannot be detected even under a microscope. Although traditional biochemical tests can confirm the presence of bacteria, they are usually time-consuming and require sample collection.
Solution: To identify subtle differences between healthy and infected tissue, the researchers decided to combine experimental measurements with advanced data processing. Raman spectra obtained from different depths and regions of the tissue contained an enormous amount of information that could not be reliably evaluated using conventional visual methods.
The scientists therefore sought to verify whether this method could reliably distinguish healthy tissue from tissue infected with Staphylococcus aureus one of the most common causes of skin and mucous membrane inflammations. At the same time, the researchers focused on monitoring the effectiveness of photodynamic therapy—an experimental treatment based on carbon quantum dots that, when exposed to blue visible light, destroy bacteria without harming healthy cells.
Use of HPC Infrastructure
The team employed a mathematical analysis based on the Euclidean cosine of the squares of the first differentiated values, which enables the comparison of similarities between spectra after their transformation. This method eliminates background interference, highlights chemical changes in the tissue structure, and allows precise identification of differences caused by the presence of bacteria or the effects of treatment.
The computational power of a supercomputer was used to process the extensive datasets. Thanks to parallel data processing, it was possible to rapidly analyze hundreds of measurements from different tissue layers and visualize their similarities in a clear results matrix. Such an approach would be virtually impossible through manual evaluation.
The solution was developed through close collaboration among experts from multiple disciplines—biology, physics, materials research, and computational science. Reconstructed skin tissues were provided by the SK-NETVAL laboratories at the Institute of Experimental Pharmacology and Toxicology of the Centre of Experimental Medicine, Slovak Academy of Sciences (SAS), which also performed the exposure to the tested substances. The photodynamic treatment was applied by the team from the Polymer Institute of SAS, and the Raman data were collected at the Institute of Physics of SAS in cooperation with the Centre for Advanced Material Application.
Results
The analysis of the spectral data revealed significant chemical differences between healthy and infected tissue that can be detected using Raman microscopy. Samples infected with Staphylococcus aureus showed distinct spectral characteristics at all analyzed depths.
The results were particularly interesting in the samples that underwent photodynamic treatment. After the application of carbon quantum dots and subsequent activation with blue light, the chemical spectra closely approached those of healthy tissue. This suggests that the treatment effectively suppresses bacterial infection without damaging the cells themselves.
The applied algorithm proved to be a reliable and fast tool for comparing spectral data. Thanks to its implementation in the HPC environment, it was possible to automatically process large volumes of measurements and evaluate the results objectively, without any subjective interference from the researcher.
Impact and Future Potential
The project has brought new insights into the potential use of light and data analysis in medical diagnostics. It has demonstrated that combining Raman microscopy with computational methods enables not only the identification of bacterial infection in tissue but also the monitoring of treatment effectiveness in real time.
In the future, this approach could be applied in the development of new antibacterial therapies or in preclinical drug testing, where it is essential to quickly and accurately assess structural changes in tissue without invasive procedures. The research team also plans to extend the methodology to other types of bacteria and tissues and to leverage the computing power of the supercomputer for testing advanced artificial intelligence algorithms that could further automate the analysis.
The project is proof that the integration of biomedicine, physics, materials research, and computational science opens new possibilities for disease diagnosis and treatment. Slovak research teams are thus not only demonstrating their scientific excellence but also contributing to pushing the boundaries of modern medicine.
BeeGFS in Practice — Parallel File Systems for HPC, AI and Data-Intensive Workloads6 Feb-This webinar introduces BeeGFS, a leading parallel file system designed to support demanding HPC, AI, and data-intensive workloads. Experts from ThinkParQ will explain how parallel file systems work, how BeeGFS is architected, and how it is used in practice across academic, research, and industrial environments.
When a production line knows what will happen in 10 minutes5 Feb-Every disruption on a production line creates stress. Machines stop, people wait, production slows down, and decisions must be made under pressure. In the food industry—especially in the production of filled pasta products, where the process follows a strictly sequential set of technological steps—one unexpected issue at the end of the line can bring the entire production flow to a halt. But what if the production line could warn in advance that a problem will occur in a few minutes? Or help decide, already during a shift, whether it still makes sense to plan packaging later the same day? These were exactly the questions that stood at the beginning of a research collaboration that brought together industrial data, artificial intelligence, and supercomputing power.
Who Owns AI Inside an Organisation? — Operational Responsibility5 Feb-This webinar focuses on how organisations can define clear operational responsibility and ownership of AI systems in a proportionate and workable way. Drawing on hands-on experience in data protection, AI governance, and compliance, Petra Fernandes will explore governance approaches that work in practice for both SMEs and larger organisations. The session will highlight internal processes that help organisations stay in control of their AI systems over time, without creating unnecessary administrative burden.
Supercomputer Accelerated the Development of Eco-Friendly Hydrogen Production
Hydrogen is one of the key elements driving the transition toward sustainable energy. Its carbon-free production represents a cornerstone of the green energy future — from industry to transportation. However, finding an efficient and affordable way to produce it remains a scientific challenge that brings together chemistry, materials research, and computational modeling.
In this success story, we take a look at how Slovak researchers harnessed the computational power of the NSCC Slovakia supercomputer to accelerate the development of a cheaper and more eco-friendly catalyst for hydrogen production. By combining laboratory experiments with HPC simulations, they succeeded in understanding the behavior of atoms on the surface of a material that could one day replace expensive metals such as platinum.
This research demonstrates how High-Performance Computing (HPC) helps push the boundaries of scientific knowledge and supports the transition toward cleaner and more sustainable energy — right from Slovak laboratories.
Challenge: Hydrogen is increasingly seen as the “fuel of the future” — carbon-free, clean, and applicable across industry, energy, and transportation. However, to make it truly accessible, it must be produced more cheaply and efficiently. Traditionally, expensive metals such as platinum have been used for this purpose, but they are not suitable for large-scale deployment. Scientists are therefore searching for new materials capable of catalyzing (accelerating) the reaction through which hydrogen is released from water.
Solution: A team of researchers from the Institute of Chemistry at Pavol Jozef Šafárik University in Košice and the Institute of Materials Research of the Slovak Academy of Sciences focused on molybdenum phosphide (MoP) — an inexpensive and readily available material with the potential to replace costly metals. They studied how MoP performs in different environments — from acidic to alkaline — and why it is able to maintain its efficiency.
The laboratory alone was not enough. The reactions on the catalyst’s surface are extremely fast and occur at the atomic level. To truly understand them, it was necessary to combine experimental research with computational simulations on a supercomputer.
Use of HPC Infrastructure
Collaboration with NSCC Slovakia and the use of a supercomputer enabled the researchers to create computer models of the catalyst and simulate what happens when hydrogen atoms bind to its surface.
Thanks to HPC, the team was able to:
uncover the reaction mechanism — how hydrogen atoms behave on the surface of MoP
verify the stability of the material in different environments
predict potential improvements to the catalyst even before it is produced in the laboratory.
Impact
The outcome is significant for several reasons:
MoP is cheaper and more accessible than platinum, which could reduce the cost of hydrogen production.
The material works across a wide range of environments, meaning it could be deployed in various types of electrolyzers worldwide.
The combination of experiments and HPC simulations saves both time and costs — allowing researchers to identify the best solutions much faster.
This research shows that HPC is not just for physicists or computer scientists — it can also play a crucial role in advancing green energy. Thanks to the computational power of the supercomputer, Slovak researchers contributed to global knowledge on eco-friendly hydrogen production and paved the way for new technologies that can directly impact energy independence and sustainability.
BeeGFS in Practice — Parallel File Systems for HPC, AI and Data-Intensive Workloads6 Feb-This webinar introduces BeeGFS, a leading parallel file system designed to support demanding HPC, AI, and data-intensive workloads. Experts from ThinkParQ will explain how parallel file systems work, how BeeGFS is architected, and how it is used in practice across academic, research, and industrial environments.
When a production line knows what will happen in 10 minutes5 Feb-Every disruption on a production line creates stress. Machines stop, people wait, production slows down, and decisions must be made under pressure. In the food industry—especially in the production of filled pasta products, where the process follows a strictly sequential set of technological steps—one unexpected issue at the end of the line can bring the entire production flow to a halt. But what if the production line could warn in advance that a problem will occur in a few minutes? Or help decide, already during a shift, whether it still makes sense to plan packaging later the same day? These were exactly the questions that stood at the beginning of a research collaboration that brought together industrial data, artificial intelligence, and supercomputing power.
Who Owns AI Inside an Organisation? — Operational Responsibility5 Feb-This webinar focuses on how organisations can define clear operational responsibility and ownership of AI systems in a proportionate and workable way. Drawing on hands-on experience in data protection, AI governance, and compliance, Petra Fernandes will explore governance approaches that work in practice for both SMEs and larger organisations. The session will highlight internal processes that help organisations stay in control of their AI systems over time, without creating unnecessary administrative burden.
The computing power of HPC brings new opportunities in the protection of the brown bear
High-Performance Computing (HPC) is a key technology of the modern era that fundamentally transforms the way complex problems are solved. Supercomputers can process enormous volumes of data and perform billions of calculations per second – tasks that would take ordinary computers months can be completed within hours. As a result, HPC accelerates scientific discoveries, enables simulations ranging from molecular interactions to climate change, and opens the door to the practical use of artificial intelligence. It is a driving force of innovation and competitiveness in medicine, industry, energy, and environmental protection.
It is not just an abstract concept – its benefits can be seen in concrete applications. Thanks to HPC, Slovak researchers trained complex artificial intelligence models on thousands of camera trap images to recognize the brown bear. A process that would normally take weeks was completed by the supercomputer in just a few hours. The result is a success story: the combination of modern technologies with nature conservation, improved human safety, and more efficient scientific work.
Challenge:
The brown bear (Ursus arctos) is one of the most iconic yet also controversial species in our nature. In Slovakia, its population is relatively stable, but monitoring its movement and behavior is crucial both for nature conservation and for human safety. Traditional methods such as visual observation or tracking are time-consuming and often inaccurate. Modern camera traps can capture thousands of images from the forest, but evaluating them manually is practically impossible.
Solution: A team of researchers from the Faculty of Natural Sciences and Informatics at UKF in Nitra developed an artificial intelligence system aimed at automatically recognizing whether a bear is present in an image or not. They used convolutional neural networks (CNNs) – the same principle applied,konvolučné neurónové siete (CNN)for example, in facial recognition on mobile phones.
For training the model, they collected:
4 974 images with a bear
656 images without a bear (other animals or empty forest)
The data were provided by the Slovak Hunting Chamber, the National Zoological Garden in Bojnice, and the State Nature Conservancy of the Slovak Republic.
Use of HPC infrastructure: Training artificial intelligence on such data is extremely computationally demanding.It requires repeated processing of thousands of high-resolution images (512×512 px), parameter tuning, and testing of different model architectures.
A regular computer would need weeks for this process. Thanks to the supercomputer and NSCC Slovakia, it was possible to:
analyze the weak points of the model and visualize what it had “learned"
HPC enabled researchers to experiment quickly and efficiently – allowing them to move from a basic model to a methodology applicable in the future.
Results
The model learned to recognize the basic features of the bear and achieved high accuracy during training (>90%).
In real conditions (night shots, noise, camera movement), however, the accuracy is still insufficient for deployment in the field.
Impact and future: Although the results are not yet perfect, the research shows that artificial intelligence has great potential in nature conservation. In the future, automatic analysis of camera traps could:
help monitor the population size and movement of bears
reduce the risk of conflicts with humans
save researchers hundreds of hours of manual work
The next step is to expand the dataset and use synthetic data – computer-generated images that will enrich the training database. Here too, the supercomputer will be crucial, since generating and processing such data is again highly demanding.
Thanks to the supercomputer, Slovak researchers managed to build the first step towards a system that could, in the future, facilitate the monitoring of the brown bear – a species that is part of Slovakia’s natural environment and cultural heritage.
BeeGFS in Practice — Parallel File Systems for HPC, AI and Data-Intensive Workloads6 Feb-This webinar introduces BeeGFS, a leading parallel file system designed to support demanding HPC, AI, and data-intensive workloads. Experts from ThinkParQ will explain how parallel file systems work, how BeeGFS is architected, and how it is used in practice across academic, research, and industrial environments.
When a production line knows what will happen in 10 minutes5 Feb-Every disruption on a production line creates stress. Machines stop, people wait, production slows down, and decisions must be made under pressure. In the food industry—especially in the production of filled pasta products, where the process follows a strictly sequential set of technological steps—one unexpected issue at the end of the line can bring the entire production flow to a halt. But what if the production line could warn in advance that a problem will occur in a few minutes? Or help decide, already during a shift, whether it still makes sense to plan packaging later the same day? These were exactly the questions that stood at the beginning of a research collaboration that brought together industrial data, artificial intelligence, and supercomputing power.
Who Owns AI Inside an Organisation? — Operational Responsibility5 Feb-This webinar focuses on how organisations can define clear operational responsibility and ownership of AI systems in a proportionate and workable way. Drawing on hands-on experience in data protection, AI governance, and compliance, Petra Fernandes will explore governance approaches that work in practice for both SMEs and larger organisations. The session will highlight internal processes that help organisations stay in control of their AI systems over time, without creating unnecessary administrative burden.
Intent Classification for Bank Chatbots through LLM Fine-Tuning
This study evaluates the application of large language models (LLMs) for intent classification within a chatbot with predetermined responses designed for banking industry websites. Specifically, the research examines the effectiveness of fine-tuning SlovakBERT compared to employing multilingual generative models, such as Llama 8b instruct from Gemma 7b instructin both their pre-trained and fine-tuned versions. The findings indicate that SlovakBERT outperforms the other models in terms of in-scope accuracy and out-of-scope false positive rate, establishing it as the benchmark for this application.
The advent of digital technologies has significantly influenced customer service methodologies, with a notable shift towards integrating chatbots for handling customer support inquiries. This trend is primarily observed on business websites, where chatbots serve to facilitate customer queries pertinent to the business’s domain. These virtual assistants are instrumental in providing essential information to customers, thereby reducing the workload traditionally managed by human customer support agents.
In the realm of chatbot development, recent years have witnessed a surge in the employment of generative artificial intelligence technologies to craft customized responses. Despite this technological advancement, certain enterprises continue to favor a more structured approach to chatbot interactions. In this perspective, the content of responses is predetermined rather than generated on-the-fly, ensuring accuracy of information and adherence to the business’s branding style. The deployment of these chatbots typically involves defining specific classifications known as intents. Each intent correlates with a particular customer inquiry, guiding the chatbot to deliver an appropriate response. Consequently, a pivotal challenge within this system lies in accurately identifying the user’s intent based on their textual input to the chatbot.
Problem Description and Our Approach
This work is a joint effort of Slovak National Competence Center for High-Performance Computing and nettle, s.r.o., which is a Slovakia-based start-up focusing on natural language processing, chatbots, and voicebots. HPC resources of Devana system were utilized to handle the extensive computations required for fine-tuning LLMs. The goal is to develop a chatbot designed for an online banking service.
In frameworks as described in the introduction, a predetermined precise response is usually preferred over a generated one. Therefore, the initial development step is the identification of a domain-specific collection of intents crucial for the chatbot’s operation and the formulation of corresponding responses for each intent. These chatbots are often highly sophisticated, encompassing a broad spectrum of a few hundreds of distinct intents. For every intent, developers craft various exemplary phrases that they anticipate users would articulate when inquiring about that specific intent. These phrases are pivotal in defining each intent and serve as foundational training material for the intent classification algorithm.
Our baseline proprietary intent classification model, which does not leverage any deep learning framework, achieves a 67% accuracy on a real-world test dataset described in the next section. The aim of this work is to develop an intent classification model using deep learning, that will outperform this baseline model.
We present two different approaches for solving this task. The first one explores the application of Bidirectional Encoder Representations from Transformers (BERT), evaluating its effectiveness as the backbone for intent classification and its capacity to power precise response generation in chatbots. The second approach employs generative large language models (LLMs) with prompt engineering to identify the appropriate intent with and without fine-tuning the selected model.
Dataset
Our training dataset consists of pairs (text, intent), wherein each text is an example query posed to the chatbot, that triggers the respective intent. This dataset is meticulously curated to cover the entire spectrum of predefined intents, ensuring a sufficient volume of textual examples for each category.
In our study, we have access to a comprehensive set of intents, each accompanied by corresponding user query examples. We consider two sets of training data: a “simple” set, providing 10 to 20 examples for each intent, and a “generated” set, which encompasses 20 to 500 examples per intent, introducing a greater volume of data albeit with increased repetition of phrases within individual intents.
These compilations of data are primed for ingestion by supervised classification models. This process involves translating the set of intents into numerical labels and associating each text example with its corresponding label, followed by the actual model training.
Additionally, we utilize a test dataset comprising approximately 300 (text, intent) pairs extracted from an operational deployment of the chatbot, offering an authentic representation of real-world user interactions. All texts within this dataset are tagged with an intent by human annotators. This dataset is used for performance evaluation of our intent classification models by feeding them the text inputs and comparing the predicted intents with those annotated by humans.
All of these datasets are proprietary to nettle, s.r.o., so they cannot be discussed in more detail here.
Evaluation Process
In this article, the models are primarily evaluated based on their in-scope accuracy using a real-world test dataset comprising 300 samples. Each of these samples belongs to the in-scope intents on which the models were trained. Accuracy is calculated as the ratio of correctly classified samples to the total number of samples. For models that also provide a probability output, such as BERT, a sample is considered correctly classified only if its confidence score exceeds a specified threshold. Throughout this article, accuracy refers to this in-scope accuracy.
As a secondary metric, the models are assessed on their out-of-scope false positive rate, where a lower rate is preferable. For this evaluation, we use artificially generated out-of-scope utterances.
The model is expected either to produce a low confidence score below the threshold (for BERT) or generate an ’invalid’ label (for LLM, as detailed in their respective sections).
Since the data at hand is in the Slovak language, the choice of a model with Slovak understanding was inevitable. Therefore, we have opted for a model named SlovakBERT [5], which is the first publicly available large-scale Slovak masked language model.
Multiple experiments were undertaken by fine-tuning this model before arriving at the top-performing model. These trials included adjustments to hyperparameters, various text preprocessing techniques, and, most importantly, the choice of training data.
Given the presence of two training datasets with relevant intents (“simple” and “generated”), experiments with different ratios of samples from these datasets were conducted. The results showed that the optimal performance of the model is achieved when training on the “generated” dataset.
After the optimal dataset was chosen, further experiments were carried out, focusing on selecting the right preprocessing for the dataset. The following options were tested:
turning text to lowercase,
removing diacritics from text, and
removing punctuation from text.
Additionally, combinations of these three options were tested as well. Given that the leveraged SlovakBERT model is case-sensitive and diacritic-sensitive, all of these text transformations impact the overall performance.
Findings from the experiments revealed that the best results are obtained when the text is lowercased and both diacritics and punctuation are removed.
Another aspect investigated during the experimentation phase was the selection of layers for fine-tuning. Options to fine-tune only one quarter, one half, three quarters of the layers, and the whole model were analyzed (with variations including fine-tuning the whole model for the first few epochs and then a selected number of layers further until convergence). The outcome showed that the average improvement achieved by these adjustments to the model’s training process is statistically insignificant. Since there is a desire to keep the pipeline as simple as possible, these alterations did not take place in the final pipeline.
Every experiment trial underwent assessment three to five times to ensure statistical robustness in considering the results.
The best model produced from these experiments had an average accuracy of 77.2% with a standard deviation of 0.012.
Banking-Tailored BERT
Given that our data contains particular banking industry nomenclature, we opted to utilize a BERT model fine-tuned specifically for the banking and finance sector. However, since this model exclusively understands the English language, the data had to be translated accordingly.
For the translation, DeepL API[1]was employed. Firstly, training, validation, and test data was translated. Due to the nature of the English language and translation, no further correction (preprocessing) was done to the text, as discussed in 2.3.1Subsequently, the model’s weights were fine-tuned to enhance performance.
The fine-tuned model demonstrated promising initial results, with accuracy slightly exceeding 70%. Unfortunately, further training and hyperparameter tuning did not yield better results. Other English models were tested as well, but all of them produced similar results. Using a customized English model proved insufficient to achieve superior results, primarily due to translation errors. The translation contained inaccuracies caused by the ’noisiness’ of the data, especially within the test dataset.
Approach 2: LLMs for Intent Classification
As mentioned in Section 2in addition to fine-tuning SlovakBERT model and other BERT-based models, the use of generative LLMs for the intent classification was explored too. Specifically, instruct models were selected for their proficiency in handling instruction prompts and question-answering tasks.
Since there are not open-source instruct model exclusively trained for the Slovak language, several multilingual models were selected: Gemma 7b instruct [6] a Llama3 8b instruct For comparison, we also include results for the closed-source OpenAI’s gpt-3.5-turbomodel under the same conditions.
Similarly to [4], we use LLM prompts with intent names and descriptions to perform zero-shot prediction. The output is expected to be the correct intent label. Since the full set of intents with their descriptions would inflate the prompt too much, we use our baseline model to select only top 3 intents. Hence the prompt data for these models was created as follows:
Each prompt includes a sentence (user’s question) in Slovak, four intent options with descriptions, and an instruction to select the most appropriate option. The first three intent options are the ones selected by the baseline model, which has a Top-3 recall of 87%. The last option is always ‘invalid’ and should be selected when neither of the first three matches the user’s question or the input intent is out-of-scope. Consequently, the highest attainable in-scope accuracy in this setting is 87%.
Pre-trained LLM Implementation
Initially, a pre-trained LLM implementation was utilized, meaning a given instruct model was leveraged without fine-tuning on our dataset. A prompt was passed to the model in the user’s role, and the model generated an assistant’s response.
To improve the results, prompt engineering was employed too. It included subtle rephrasing of the instruction; instructing the model to answer only with the intent name, or with the number/letter of the correct option; or placing the instruction in the system’s role while the sentence and options were in the user’s role.
Despite these efforts, this approach did not yield better results than SlovakBERT’s fine-tuning. However, it helped us identify the most effective prompt formats for fine-tuning of these instruct models. Also, these steps were crucial in understanding the models’ behaviour and response pattern, which we leveraged in fine-tuning strategies of these models.
LLM Optimization through Fine-Tuning
The prompts that the pre-trained models reacted best to were used for fine-tuning of these models. Given that LLMs do not require extensive fine-tuning datasets, we utilized our “simple” dataset as detailed in section 2.1The model was then fine-tuned to respond to the specified prompts with the appropriate label names.
Due to the size of the chosen models, parameter efficient training (PEFT) [2] strategy was employed to handle the memory and time issues. PEFT updates only a subset of parameters, while “freezing” the rest, therefore reducing the number of trainable parameters. Specifically, the Low-Rank Adaptation (LoRA) [3] approach was used.
To optimize performance, various hyperparameters were tuned too, including learning rate, batch size, lora alpha parameter of the LoRA configuration, the number of gradient accumulation steps, and chat template formulation.
Optimizing language models involves high computational demands, necessitating the use of HPC resources to achieve the desired performance and efficiency. The Devana system, with each node containing 4 NVidia A100 GPUs with 40GB of memory each, offers significant computational power. In our case, both models we are fine-tuning fit within the memory of one GPU (full size, not quantized) with a maximum batch size of 2.
Although leveraging all 4 GPUs in a node would reduce training time and allow for a larger overall batch size (while maintaining the same batch size per device), for benchmarking purposes and to guarantee consistency and comparability of the results, we conducted all experiments using 1 GPU only.
These efforts led to some improvements in models’ performances. Particularly for Gemma 7b instruct instruct in reducing the number of false positives. On the other hand, while fine-tuning Llama3 8b instruct, both metrics (accuracy and the number of false positives) were improved. However, neither Gemma 7b instruct nor Llama3 8b instruct models outperformed the capabilities of the fine-tuned SlovakBERT model.
With Gemma 7b instructsome sets of hyperparameters resulted in high accuracy but also a high false positive rate, while others led to lower accuracy and low false positive rate. Search for a set of hyperparameters bringing balanced accuracy and false positive rate was challenging. The best-performing configuration achieved an accuracy slightly over 70% with a false positive rate of 4.6%. Compared to the model’s performance without fine-tuning, fine-tuning only slightly increased the accuracy, but dramatically reduced the false positive rate by almost 70%.
With Llama3 8b instruct, the best-performing configuration achieved an accuracy of 75.1% with a false positive rate of 7.0%. Compared to the model’s performance without fine-tuning, fine-tuning significantly increased the accuracy and also halved the false positive rate.
Comparison with a Closed-Source Model
To benchmark our approach against a leading closed-source LLM, we conducted experiments using gpt-3.5-turbo OpenAI.[1]We employed identical prompt data for a fair comparison and tested both the pre-trained and fine-tuned versions of this model. Without fine-tuning, gpt-3.5-turbo achieved an accuracy of 76%, although it exhibited a considerable false positive rate. After fine-tuning, the accuracy improved to almost 80%, and the false positive rate was considerably reduced.
Results
In our initial strategy, involving fine-tuning SlovakBERT model for our task, we achieved average accuracy of 77.2% with a standard deviation of 0.012, representing an increase of 10% from the baseline model’s accuracy.
Fine-tuning banking-tailored BERT on translated dataset showcased the final accuracy slightly under 70%, which outperforms the baseline model, however it does not surpass the performance of fine-tuned SlovakBERT model.
Subsequently, we experimented with pre-trained (but not fine-tuned with our data) generative LLMs for our task. While these models showed promising capabilities, their performance was inferior to that of the SlovakBERT fined-tuned for our specific task. Therefore, we proceeded to fine-tune these models, namely Gemma 7b instruct and Llama3 8b instruct. Gemma 7b instruct from Llama3 8b instruct.
The fine-tuned Gemma 7b instruct 7b instruct models demonstrated a final accuracy comparable to the banking-tailored BERT, while fine-tuned Llama3 8b instruct performance was slightly worse than the SlovakBERT fined-tuned. Despite extensive efforts to find the configuration surpassing the capabilities of the SlovakBERT model, these attemps were unsuccessful, establishing the SlovakBERT model as our benchmark for performance.
All results are displayed in Table 1including the baseline proprietary model and a closed-source model for comparison.
Table 1: Percentage comparison of models’ in-scope accuracy and out-of-scope false positive rate.
Conclusion
The goal of this study was to find an approach leveraging a pre-trained language model (fine-tuned or not) as a backbone for chatbot for banking industry. The data provided for the study consisted of pairs of text and intent, where the text represents user’s (customer’s) query and the intent represents the triggered intent.
Several language models were experimented with, including SlovakBERT, banking-tailored BERT and generative models Gemma 7b instruct from Llama3 8b instructAfter experimentations with the dataset, fine-tuning configurations and prompt engineering; fine-tuning SlovakBERT emerged as the best approach yielding final accuracy slightly above 77%, which represents a 10% increase from the baseline’s models accuracy, demonstrating its suitability for our task.
In conclusion, our study highlights the efficacy of fine-tuning pre-trained language models for developing a robust chatbot with accurate intent classification. Moving forward, leveraging these insights will be crucial for further enhancing performance and usability in real-world banking applications.
Research results were obtained with the support of the Slovak National competence centre for HPC, the EuroCC 2 project and Slovak National Supercomputing Centre under grant agreement 101101903-EuroCC 2-DIGITAL-EUROHPC-JU-2022-NCC-01.
References:
[1] AI@Meta. Llama 3 model card. 2024. URL: https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
[2] Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey, 2024. arXiv:2403.14608.
[3] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. CoRR, abs/2106.09685, 2021. URL: https://arxiv.org/abs/2106.09685, arXiv:2106.09685.
[4] Soham Parikh, Quaizar Vohra, Prashil Tumbade, and Mitul Tiwari. Exploring zero and fewshot techniques for intent classification, 2023. URL: https://arxiv.org/abs/2305.07157, arXiv:2305.07157.
[5] Matúš Pikuliak, Štefan Grivalský, Martin Konôpka, Miroslav Blšták, Martin Tamajka, Viktor Bachratý, Marián Šimko, Pavol Balážik, Michal Trnka, and Filip Uhlárik. Slovakbert: Slovak masked language model. CoRR, abs/2109.15254, 2021. URL: https://arxiv.org/abs/2109.15254, arXiv:2109.15254.
[6] Gemma Team, Thomas Mesnard, and Cassidy Hardin et al. Gemma: Open models based on gemini research and technology, 2024. arXiv:2403.08295.
Authors
Bibiána Lajčinová – Slovak National Supercomputing Centre Patrik Valábek – Slovak National Supercomputing Centre, ) Institute of Information Engineering, Automation, and Mathematics, Slovak University of Technology in Bratislava Michal Spišiak – nettle, s.r.o., Bratislava, Slovakia
BeeGFS in Practice — Parallel File Systems for HPC, AI and Data-Intensive Workloads6 Feb-This webinar introduces BeeGFS, a leading parallel file system designed to support demanding HPC, AI, and data-intensive workloads. Experts from ThinkParQ will explain how parallel file systems work, how BeeGFS is architected, and how it is used in practice across academic, research, and industrial environments.
When a production line knows what will happen in 10 minutes5 Feb-Every disruption on a production line creates stress. Machines stop, people wait, production slows down, and decisions must be made under pressure. In the food industry—especially in the production of filled pasta products, where the process follows a strictly sequential set of technological steps—one unexpected issue at the end of the line can bring the entire production flow to a halt. But what if the production line could warn in advance that a problem will occur in a few minutes? Or help decide, already during a shift, whether it still makes sense to plan packaging later the same day? These were exactly the questions that stood at the beginning of a research collaboration that brought together industrial data, artificial intelligence, and supercomputing power.
Who Owns AI Inside an Organisation? — Operational Responsibility5 Feb-This webinar focuses on how organisations can define clear operational responsibility and ownership of AI systems in a proportionate and workable way. Drawing on hands-on experience in data protection, AI governance, and compliance, Petra Fernandes will explore governance approaches that work in practice for both SMEs and larger organisations. The session will highlight internal processes that help organisations stay in control of their AI systems over time, without creating unnecessary administrative burden.
Leveraging LLMs for Efficient Religious Text Analysis
The analysis and research of texts with religious themes have historically been the domain of philosophers, theologians, and other social sciences specialists. With the advent of artificial intelligence, such as the large language models (LLMs), this task takes on new dimensions. These technologies can be leveraged to reveal various insights and nuances contained in religious texts — interpreting their symbolism and uncovering their meanings. This acceleration of the analytical process allows researchers to focus on specific aspects of texts relevant to their studies.
One possible research task in the study of texts with religious themes involves examining the works of authors affiliated with specific religious communities. By comparing their writings with the official doctrines and teachings of their denominations, researchers can gain deeper insights into the beliefs, convictions, and viewpoints of the communities shaped by the teachings and unique contributions of these influential authors.
This report proposes an approach utilizing embedding indices and LLMs for efficient analysis of texts with religious themes. The primary objective is to develop a tool for information retrieval, specifically designed to efficiently locate relevant sections within documents. The identification of discrepancies between the retrieved sections of texts from specific religious communities and the official teaching of the particular religion the community originates from is not part of this study; this task is entrusted to theological experts.
This work is a joint effort of Slovak National Competence Center for High-Performance Computing and the Faculty of Theology at Trnava University. Our goal is to develop a tool for information retrieval using LLMs to help theologians analyze religious texts more efficiently. To achieve this, we are leveraging resources of HPC system Devana to handle the computations and large datasets involved in this project.
Dataset
The texts used for the research in this study originate from the religious community known as the Nazareth Movement (commonly referred to as ”Beňovci”), which began to form in the 1970s. The movement, which some scholars identify as having sect-like characteristics, is still active today, in reduced and changed form. Its founder, Ján Augustín Beňo (1921 - 2006), was a secretly ordained Catholic priest during the totalitarian era. Beňo encouraged members of the movement to actively live their faith through daily reading of biblical texts and applying them in practice through specific resolutions. The movement spread throughout Slovakia, with small communities existing in almost every major city. It also spread to neighboring countries such as Poland, the Czech Republic, Ukraine, and Hungary. In 2000, the movement included approximately three hundred married couples, a thousand children, and 130 priests and students preparing for priesthood. The movement had three main goals: radical prevention in education, fostering priests who could act as parental figures to identify and nurture priestly vocations in children, and the production and distribution of samizdat materials needed for catechesis and evangelization.
27 documents with texts from this community are available for research. These documents, which significantly influenced the formation of the community and its ideological positions, were reproduced and distributed during the communist regime in the form of samizdats — literature banned by the communist regime. After the political upheaval, many of them were printed and distributed to the public outside the movement. Most of the analyzed documents consist of texts intended for ”morning reflections” — short meditations on biblical texts. The documents also include the founder’s comments on the teachings of the Catholic Church and selected topics related to child rearing, spiritual guidance, and catechesis for children.
Although the documents available to us contained a few duplications, this did not pose a problem for the information retrieval task and will thus remain unaddressed in this report. All of the documents are written exclusively in Slovak language.
One of the documents is annotated for test purposes by experts from the partner faculty, who have long been studying the Nazareth Movement. By annotations, we refer to text parts labeled as belonging to one of the five classes, where these classes represent five topics, namely
Directive obedience
Hierarchical upbringing
Radical adoption of life model
Human needs fulfilled only in religious community and family
Strange/Unusual/Intense
Additionally, each of this topics is supplemented with a set of queries designed to test the retrieval capabilities of our solution.
Table 1
Strategy/Solution
There are multiple strategies appropriate for solving this task, including text classification, topic modelling, retrieval-augmented generation (RAG), and fine-tuning of LLMs. However, the theologians’ requirement is to identify specific parts of the text for detailed analysis, necessitating the retrieval of exact wording. Therefore, a choice was made to leverage information retrieval. This approach differs from RAG, which typically incorporates both information retrieval and text generation components, in focusing solely on retrieving textual data, without the additional step of new content generation.
Information retrieval leverages LLMs to transform complex data such as text, into a numerical representation that captures the semantic meaning and context of the input. This numerical representation, known as embedding, can be used to conduct semantic searches by analysing the positions and proximity of embeddings within a multi-dimensional vector space. By using queries, the system can retrieve relevant parts of the text by measuring the similarity between the query embeddings and the text embeddings. This approach does not require any fine-tuning of the existing LLMs, therefore the models can be used without any modification and the workflow remains quite simple.
Model choice
Information retrieval leverages LLMs to transform complex data such as text, into a numerical representation that captures the semantic meaning and context of the input. This numerical representation, known as embedding, can be used to conduct semantic searches by analysing the positions and proximity of embeddings within a multi-dimensional vector space. By using queries, the system can retrieve relevant parts of the text by measuring the similarity between the query embeddings and the text embeddings.
These four models were leverages to acquire vector representations of the chunked text, and their specific contributions will be discussed in the following parts of the study.
Data preprocessing
The first step of data preprocessing involved text chunking. The primary reason for this step was to meet the requirement of religious scholars for retrieval of paragraph-sized chunks. Besides, documents needed to be split into smaller chunks anyway due to the limited input lengths of some LLMs. For this purpose, the Langchain library was utilized. It offers hierarchical chunking that produces overlapping chunks of a specific length (with a desired overlap) to ensure that the context is preserved. Chunks with lengths of 300, 400, 500 and 700 symbols were generated. Subsequent preprocessing steps included removal of diacritics, case normalization according to the requirements of the models and stopwords removal. The removal of stopwords is a common practice in natural language processing tasks. While some models may benefit from the exclusion of stopwords to improve relevancy of retrieved chunks, others may take advantage of retaining stopwords to preserve contextual information essential for understanding the text.
Table 2
Vector Embeddings
Vector embeddings were created from text chunks using selected pre-trained language models.
For the Slovak-BERT model, generating embedding involves leveraging the model without any additional layers for inference and then using the first embedding, which contains all the semantic meaning of the chunk, as the context embedding. Other models produce embeddings in required form, so no further postprocessing was needed.
In the subsequent results section, the performance of all created embedding models will be analyzed and compared based on their ability to capture and represent the semantic content of the text chunks.
Results
Prior to conducting quantitative tests, all embedding indices underwent preliminary evaluation to determine the level of understanding of the Slovak language and the specific religious terminology by the selected LLMs. This preliminary evaluation involved subjective judgement of the relevance of retrieved chunks.
These tests revealed that the E5 model embeddings exhibit limited effectiveness on our data. When retrieving for a specific query, the retrieved chunks contained most of the key words used in the query, but did not contain the context of the query. One of the explanations could be that this model prioritizes word-level matches over the nuanced context in Slovak language, because it’s possible that the training data of this model for Slovak was less extensive or less contextually rich, leading to weaker performance. However, these observations are not definitive conclusions but rather hypotheses based on current, limited results. A decision was made not to further evaluate the performance of the embedding indices leveraging E5 embeddings, as it seemed irrelevant given the inability to effectively capture the nuances of the religious texts. On the other hand, the abilities of Slovak-BERT model, based on the RoBERTa architecture characterized by its relatively simple architecture, exceeded the expectations. Moreover, the performance of text-embedding-3-small and BGE M3 embeddings met expectations, as the first test, subjectively evaluated, demonstrated a very good grasp of the context, proficiency in Slovak language, and understanding of the nuances within the religious texts.
Therefore, quantitative tests were performed only on embedding indices utilizing Slovak-BERT, OpenAI’s text-embedding-3-small and BGE M3 embeddings.
Given the problem specification and the nature of test annotations, there arises a potential concern regarding the quality of the annotations. It is possible that some text parts were misclassified as there may be sections of text that belong to multiple classes. This, combined with the possibility of human error, can affect the consistency and accuracy of the annotations.
With this consideration in mind, we have opted to focus solely on recall evaluation. By recall, we mean the proportion of correctly retrieved chunks out of the total number of annotated chunks, regardless of the fraction of false positive chunks. Recall will be evaluated for every topic and for every length-specific embedding index for all selected LLMs.
Moreover, the provided test queries might also reflect the complexity and interpretative nature of religious studies. For example, consider a query ”God’s will” for the topic Directive obedience. While careful reader understands how this query relates to the given topic, it might not be as clear to a language model. Therefore, apart from evaluating using provided queries, another evaluation was conducted using queries acquired through contextual augmentation. Contextual/query augmentation is a prompt engineering technique for enhancing text data quality and is well-documented in various research papers , . This technique involves prompting a language model to generate a new query based on initial query and other contextual information in order to formulate a better query. Language model used for generation of queries through query augmentation technique was GPT 3.5 and these queries will be referred to as ”GPT queries” throughout the rest of the report.
Slovak-BERT embedding indices
Recall evaluation for embedding indices utilizing Slovak-BERT embeddings for four different chunk sizes with and without stopwords removal is presented in Figure 1The evaluation covers each topic specified in the list in Section 2 and includes both original queries and GPT queries.
We observe, that GPT queries generally yield better results compared to the original queries, except for the last two topics, where both sets of queries produce similar results. Also, it is apparent, that Slovak-BERT-based embeddings benefit from stopwords removal in most cases. The highest recall values were achieved for the third topic Radical adoption of life model, with the chunk size of 700 symbols with removed stopwords, reaching more than 47%. In contrast, the worst results were observed for the topic Strange/Unusual/Intense, where neither the original nor GPT queries successfully retrieved relevant parts. In some cases none of the relevant parts were retrieved at all.
Recall values obtained for all topics using both original and GPT queries, across various chunk sizes of embeddings generated using the Slovak-BERT model. Embedding indices marked as +SW include stopwords, while -NoSW indicates stopwords were removed.
Figure 1: Recall values obtained for all topics using both original and GPT queries, across various chunk sizes of embeddings generated using the Slovak-BERT model. Embedding indices marked as +SW include stopwords, while -NoSW indicates stopwords were removed.
OpenAI’s text-embedding-3-small embedding indices
Similar to the evaluation for Slovak-BERT embedding indices, evaluation charts for embedding indices utilizing OpenAI’s text-embedding-3-small embeddings are presented in Figure 2The recall values are generally much higher than those observed with Slovak-BERT embeddings. As with the previous results, GPT queries produce better outcomes. We can observe a subtle trend in recall value and chunk size dependency – longer chunk sizes generally yield higher recall values.
An interesting observation can be made for the topic Radical adoption of life model. When using the original queries, hardly any relevant results were retrieved. However, when using GPT queries, recall values were much higher, reaching almost 90% for chunk sizes of 700 symbols.
Regarding the removal of stopwords, its impact on embeddings varies. For topics 4 and 5, stopwords removal proves beneficial. However, for the other topics, this preprocessing step does not offer advantages.
Topics 4 and 5 exhibited the weakest performance among all topics. This may be due to the nature of the queries provided for these topics, which are quotes or full sentences, compared to queries for other topics, that are phrases, keywords or expressions. It appears that this model performs better with the latter type of queries. On the other hand, since the queries for topics 4 and 5 are full sentences, the embeddings benefit from stopwords removal, as it probably helps in handling the context of sentence-like queries.
Topic 4 is very specific and abstract, while topic 5 is very general, making it understandable that capturing this topic in queries is challenging. The specificity of topic 4 might require more nuanced test queries, as the provided test queries probably did not contain all nuances of a given topic. Conversely, the general nature of topic 5 might benefit from a different analytical approach. Methods like Sentiment Analysis could potentially grasp the strange, unusual, or intense mood in relation to the religious themes analysed.
Figure 2: Recall values assessed for all topics using both original and GPT queries, utilizing various chunk sizes of embeddings generated with the text-embedding-3-small model. Embedding indices labeled +SW include stopwords, and those labeled -NoSW have stopwords removed.
BGE M3 embedding indices
Evaluation charts for embedding indices utilizing BGE M3 embeddings are presented in Figure 3The recall values demonstrate a performance falling between Slovak-BERT and OpenAI’s text-embedding-3-small embeddings. While, in some cases, not reaching the recall values of OpenAI’s embeddings, BGE M3 embeddings show competitive performance, particularly considering their open-source availability compared to OpenAI’s embeddings, that are accessible through API, which might pose a problem with data confidentiality.
With these embeddings, we also observe the same phenomenon as with OpenAI’s text-embedding-3-small embeddings: shorter, phrase-like queries are preferred over quote-like queries. Therefore, recall values are higher for first three topics.
Stopwords removal seems to be mostly beneficial, mainly for the last two topics.
Figure 3: Recall values for all topics using original and GPT queries, with embeddings of different chunk sizes produced by the BGE M3 model. Indices labeled as +SW contain stopwords, while -NoSW indicates their removal.
Conclusion
This paper presents an approach for analysis of text with religious themes with the use of text numerical representations known as embeddings, generated by three selected pre-trained large language models: Slovak-BERT, OpenAI’s text-embedding-3-small and BGE M3 embedding model. These models were selected after it was evaluated, that their proficiency in Slovak language and religious terminology is sufficient to handle the task of information retrieval for a given set of documents.
Challenges related to quality of test queries were addressed using query augmentation technique. This approach helped in formulating appropriate queries, resulting in more relevant retrieval of text chunks, capturing all the nuances of topics that interest theologians.
Evaluation results proved the effectiveness of the embeddings produced by these models, particularly the text-embedding-3-small from OpenAI, which exhibited a strong contextual understanding and linguistic proficiency. The recall value for this model’s retrieval abilities varied depending of the topic and queries used, with the highest values reaching almost 90% for topic Radical adoption of life model when using GPT queries and chunk length of 700 symbols. Generally, text-embedding-3-small performed best with the longest chunk lengths studied, showing a trend of increasing recall with the increase in chunk length. The topic Strange/Unusual/Intense had the lowest recall, possibly due to the uncertainty in topic specification.
For Slovak-BERT embedding indices, the recall values were slightly lower, but still impressive given the simplicity of this language model. Better results were achieved using GPT queries, with the best recall value of 47.1% for the topic Radical adoption of life model at a chunk length of 700 symbols, with embeddings created from chunks with removed stropwords. Generally, this embedding model benefited most from the stopwords removal preprocessing step.
As for BGE M3 embeddings, the result were impressive, achieving high recall, though not as high as OpenAI’s embeddings. However, considering that BGE M3 is an open-source model, these results are remarkable.
These findings highlight the potential of leveraging LLMs for specialized domains like analysis of texts with religious themes. Future work could explore the connections between text chunks using clustering techniques with embeddings to discover hidden associations and inspirations of the text authors. For theologians, future work lies in examining the retrieved text parts to identify deviations from official teaching of Catholic Church, shedding light on movement’s interpretations and insights.
Acknowledgment
Research results were obtained with the support of the Slovak National competence centre for HPC, the EuroCC 2 project and Slovak National Supercomputing Centre under grant agreement 101101903-EuroCC 2-DIGITAL-EUROHPC-JU-2022-NCC-01.
Computational resources were procured in the national project National competence centre for high performance computing (project code: 311070AKF2) funded by European Regional Development Fund, EU Structural Funds Informatization of society, Operational Program Integrated Infrastructure.
Bibiána Lajčinová – Slovak National Supercomputing Centre Jozef Žuffa – Faculty of Theology, Trnava University, Milan Urbančok – Faculty of Theology, Trnava University,
References:
[1] Matúš Pikuliak, Štefan Grivalský, Martin Konôpka, Miroslav Blšťák, Martin Tamajka, Viktor Bachratý, Marián Šimko, Pavol Balážik, Michal Trnka, and Filip Uhlárik. Slovakbert: Slovak masked language model, 2021.
[2] Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024.
[3] Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multi-lingual e5 text embeddings: A technical report, 2024.
BeeGFS in Practice — Parallel File Systems for HPC, AI and Data-Intensive Workloads6 Feb-This webinar introduces BeeGFS, a leading parallel file system designed to support demanding HPC, AI, and data-intensive workloads. Experts from ThinkParQ will explain how parallel file systems work, how BeeGFS is architected, and how it is used in practice across academic, research, and industrial environments.
When a production line knows what will happen in 10 minutes5 Feb-Every disruption on a production line creates stress. Machines stop, people wait, production slows down, and decisions must be made under pressure. In the food industry—especially in the production of filled pasta products, where the process follows a strictly sequential set of technological steps—one unexpected issue at the end of the line can bring the entire production flow to a halt. But what if the production line could warn in advance that a problem will occur in a few minutes? Or help decide, already during a shift, whether it still makes sense to plan packaging later the same day? These were exactly the questions that stood at the beginning of a research collaboration that brought together industrial data, artificial intelligence, and supercomputing power.
Who Owns AI Inside an Organisation? — Operational Responsibility5 Feb-This webinar focuses on how organisations can define clear operational responsibility and ownership of AI systems in a proportionate and workable way. Drawing on hands-on experience in data protection, AI governance, and compliance, Petra Fernandes will explore governance approaches that work in practice for both SMEs and larger organisations. The session will highlight internal processes that help organisations stay in control of their AI systems over time, without creating unnecessary administrative burden.
Mapping Tree Positions and Heights Using PointCloud Data Obtained Using LiDAR Technology
The goal of the collaboration between the Slovak National Supercomputing Centre (NSCC) and the company SKYMOVE within the National Competence Center for HPC project was to design and implement a pilot software solution for processing data obtained using LiDAR (Light Detection and Ranging) technology mounted on drones.
Data collection
LiDAR is an innovative method of remote distance measurement that is based on measuring the travel time of laser pulse reflections from objects. LiDAR emits light pulses that hit the ground or object and return to the sensors. By measuring the return time of the light, LiDAR determines the distance to the point where the laser beam was reflected.
LiDAR can emit 100k to 300k pulses per second, capturing dozens to hundreds of pulses per square meter of the surface, depending on specific settings and the distance to the scanned object. This process creates a point cloud (PointCloud) consisting of potentially millions of points. Modern LiDAR use involves data collection from the air, where the device is mounted on a drone, increasing the efficiency and accuracy of data collection. In this project, drones from DJI, particularly the DJI M300 and Mavic 3 Enterprise (Fig. 1), were used for data collection. The DJI M300 is a professional drone designed for various industrial applications, and its parameters make it suitable for carrying LiDAR.
The DJI M300 drone was used as a carrier for the Geosun LiDAR (Fig. 1). This is a mid-range, compact system with an integrated laser scanner, a positioning, and orientation system. Given the balance between data collection speed and data quality, the data was scanned from a height of 100 meters above the surface, allowing for the scanning of larger areas in a relatively short time with sufficient quality.
The collected data was geolocated in the S-JTSK coordinate system (EPSG:5514) and the Baltic Height System after adjustment (Bpv), with coordinates given in meters or meters above sea level. In addition to LiDAR data, aerial photogrammetry was performed simultaneously, allowing for the creation of orthophotomosaics. Orthophotomosaics provide a photographic record of the surveyed area in high resolution (3 cm/pixel) with positional accuracy up to 5 cm. The orthophotomosaic was used as a basis for visual verification of the positions of individual trees.
Figure 1. DJI M300 Drone (left) and Geosun LiDAR (right).
Data classification
The primary dataset used for the automatic identification of trees was a LiDAR point cloud in LAS/LAZ format (uncompressed and compressed form). LAS files are a standardized format for storing LiDAR data, designed to ensure efficient storage of large amounts of point data with precise 3D coordinates. LAS files contain information about position (x, y, z), reflection intensity, point classification, and other attributes necessary for LiDAR data analysis and processing. Due to their standardization and compactness, LAS files are widely used in geodesy, cartography, forestry, urban planning, and many other fields requiring detailed and accurate 3D representations of terrain and objects.
The point cloud needed to be processed into a form that would allow for an easy identification of individual tree or vegetation points. This process involves assigning a specific class to each point in the point cloud, known as classification.
Various tools can be used for point cloud classification. Given our positive experience, we decided to use the Lidar360 software from GreenValley International [1]. In the point cloud classification, the individual points were classified into the following categories: unclassified (1), ground (2), medium vegetation (4), high vegetation (5), buildings (6). A machine learning method was used for classification, which, after being trained on a representative training sample, can automatically classify points of any input dataset (Fig. 2).
The training sample was created by manually classifying points in the point cloud into the respective categories. For the purposes of automated tree identification in this project, the ground and high vegetation categories are essential. However, for the best classification results of high vegetation, it is also advisable to include other classification categories. The training sample was composed of multiple smaller areas from the entire region including all types of vegetation, both deciduous and coniferous, as well as various types of buildings. Based on the created training sample, the remaining points of the point cloud were automatically classified. It should be noted that the quality of the training sample significantly affects the final classification of the entire area.
Figure 2. Example of a point cloud of an area colored using an orthophotomosaic (left) and the corresponding classification (right) in CloudCompare.
Data segmentation
In the next step, the classified point cloud was segmented using the CloudCompare software [2]. Segmentation generally means dividing classified data into smaller units – segments that share common characteristics. The goal of segmenting high vegetation was to assign individual points to specific trees.
For tree segmentation, the TreeIso plugin in the CloudCompare software package was used, which automatically recognizes trees based on various height and positional criteria (Fig. 3). The overall segmentation consists of three steps:
Grouping points that are close together into segments and removing noise.
Merging neighboring point segments into larger units.
Composing individual segments into a whole that forms a single tree.
The result is a complete segmentation of high vegetation. These segments are then saved into individual LAS files and used for further processing to determine the positions of individual trees. A significant drawback of this tool is that it operates only in serial mode, meaning it can utilize only one CPU core, which greatly limits its use in an HPC environment.
Figure 3. Segmented point cloud in CloudCompare using the TreeIso plugin module.
As an alternative method for segmentation, we explored the use of orthophotomosaics of the studied areas. Using machine learning methods, we attempted to identify individual tree crowns in the images and, based on the geolocational coordinates determined, identify the corresponding segments in the LAS file. For detecting tree crowns from the orthophotomosaic, the YOLOv5 model [3] with pretrained weights from the COCO128 database [4] was used. The training data consisted of 230 images manually annotated using the LabelImg tool [5]. The training unit consisted of 300 epochs, with images divided into batches of 16 samples, and their size was set to 1000x1000 pixels, which proved to be a suitable compromise between computational demands and the number of trees per section. The insufficient quality of this approach was particularly evident in areas with dense vegetation (forested areas), as shown in Figure 4. We believe this was due to the insufficient robustness of the chosen training set, which could not adequately cover the diversity of image data (especially for different vegetative periods). For these reasons, we did not develop segmentation from photographic data further and focused solely on segmentation in the point cloud.
Figure 4. Tree segmentation in the orthophotomosaic using the YOLOv5 tool. The image illustrates the problem of detecting individual trees in the case of dense vegetation (continuous canopy).
To fully utilize the capabilities of the Devana supercomputer, we deployed the lidR library [6] in its environment. This library, written in R, is a specialized tool for processing and analyzing LiDAR data, providing an extensive set of functions and tools for reading, manipulating, visualizing, and analyzing LAS files. With lidR, tasks such as filtering, classification, segmentation, and object extraction from point clouds can be performed efficiently. The library also allows for surface interpolation, creating digital terrain models (DTM) and digital surface models (DSM), and calculating various metrics for vegetation and landscape structure. Due to its flexibility and performance, lidR is a popular tool in geoinformatics and is also suitable for HPC environments, as most of its functions and algorithms are fully parallelized within a single compute node, allowing for full utilization of available hardware. When processing large datasets where the performance or capacity of a single compute node is insufficient, splitting the dataset into smaller parts and processing them independently can leverage multiple HPC nodes simultaneously.
The lidR library includes the locate_trees() function, which can reliably identify tree positions. Based on selected parameters and algorithms, the function analyzes the point cloud and identifies tree locations. In our case, the lmf algorithm, based on maximum height localization, was used [7]. The algorithm is fully parallelized, enabling efficient processing of relatively large areas in a short time.
The identified tree positions can then be used in the silva2016 algorithm for segmentation with the segment_trees() function [8]. This function segments the identified trees into separate LAS files (Fig. 5), similar to the TreeIso plugin module in CloudCompare. These segmented trees in LAS files are then used for further processing, such as determining the positions of individual trees using the DBSCAN clustering algorithm [9].
Figure 5. Tree positions determined using the lmf algorithm (left, red dots) and corresponding tree segments identified by the silva2016 algorithm (right) using the lidR library.
Detection of tree trunks using the DBSCAN clustering algorithm
To determine the position and height of trees in individual LAS files obtained from segmentation, we used various approaches. The height of each tree was obtained based on the z-coordinates for each LAS file as the difference between the minimum and maximum coordinates of the point clouds. Since some point cloud segments contained more than one tree, it was necessary to identify the number of tree trunks within these segments.
Tree trunks were identified using the DBSCAN clustering algorithm with the following settings: maximum distance between two points within one cluster (= 1 meter) and minimum number of points in one cluster (= 10). The position of each identified trunk was then obtained based on the x and y coordinates of the cluster centroids. The identification of clusters using the DBSCAN algorithm is illustrated in Figure 6.
Figure 6. Segments of the point cloud, PointCloud (left column), and the corresponding detected clusters at heights of 1-5 meters (right column).
Determining tree heights using surface interpolation
As an alternative method for determining tree heights, we used the Canopy Height Model (CHM). CHM is a digital model that represents the height of the tree canopy above the terrain. This model is used to calculate the height of trees in forests or other vegetative areas. CHM is created by subtracting the Digital Terrain Model (DTM) from the Digital Surface Model (DSM). The result is a point cloud, or raster, that shows the height of trees above the terrain surface (Fig. 7).
If the coordinates of tree's position are known, we can easily determine the corresponding height of the tree at that point using this model. The calculation of this model can be easily performed using the lidR library with the grid_terrain() function, which creates the DTM, and the grid_canopy() function, which calculates the DSM.
Figure 7. Canopy Height Model (CHM) for the studied area (coordinates in meters on the X and Y axes), with the height of each point in meters represented using a color scale.
Comparison of results
To compare the results achieved by the approaches mentioned before, we focused on the Petržalka area in Bratislava, where manual measurements of tree positions and heights had already been conducted. From the entire area (approximately 3500x3500 m), we selected a representative smaller area of 300x300 m (Fig. 2). We obtained results for the TreeIso plugin module in CloudCompare (CC), working on a PC in a Windows environment, and results for the locate_trees() and segment_trees() algorithms using the lidR library in the HPC environment of the Devana supercomputer. We qualitatively and quantitatively evaluated the tree positions using the Munkres (Hungarian Algorithm) [10] for optimal matching. The Munkres algorithm, also known as the Hungarian Algorithm, is an efficient method for finding the optimal matching in bipartite graphs. Its use in matching trees with manually determined positions means finding the best match between trees identified from LiDAR data and their known positions. By setting an appropriate distance threshold in meters (e.g., 5 m), we can qualitatively determine the number of accurately identified tree positions. The results are processed using histograms and percentage accuracy of tree positions depending on the chosen precision threshold (Fig. 8).
We found that both methods achieve almost the same result at a 5-meter distance threshold, approximately 70% accurate tree positions. The method used in CloudCompare shows better results, i.e., a higher percentage at lower threshold values, as reflected in the corresponding histograms (Fig. 8). When comparing both methods, we achieve up to approximately 85% agreement at a threshold of up to 5 meters, indicating the qualitative parity of both approaches. The quality of the results is mainly influenced by the accuracy of vegetation classification in point clouds, as the presence of various artifacts incorrectly classified as vegetation distorts the results. Tree segmentation algorithms cannot eliminate the impact of these artifacts.
Figure 8. The histograms on the left display the number of correctly identified trees depending on the chosen distance threshold in meters (top: CC – CloudCompare - method, bottom: lidR method). The graphs on the right show the percentage success rate of correctly identified tree positions based on the method used and the chosen distance threshold in meters.
Parallel efficiency analysis of the locate_trees() algorithm in the lidR library
To determine the efficiency of parallelizing the locate_trees() algorithm in the lidR library, we applied the algorithm to the same study area using different numbers of CPU cores – 1, 2, 4, up to 64 (the maximum of the compute node of Devana HPC system). To assess sensitivity to problem size, we tested it on three areas of different sizes – 300x300, 1000x1000, and 3500x3500 meters. The times measured are shown in Table 1, and the scalability of the algorithm is illustrated in Figure 9. The results show that the scalability of the algorithm is not ideal. When using approximately 20 CPU cores, the algorithm's efficiency drops to about 50%, and with 64 CPU cores, the efficiency is only 15-20%. The efficiency is also affected by the problem size – the larger the area, the lower the efficiency, although this effect is not as pronounced. In conclusion, for effective use of the algorithm, it is suitable to use 16-32 CPU cores and to achieve maximum efficiency of the available hardware by appropriately dividing the study area into smaller parts. Using more than 32 CPU cores is not efficient but still allows for further acceleration of the computation.
Figure 9. SpeedUp of the lmf algorithm in the locate_trees() function of the lidR library depending on the number of CPU cores (NCPU)CPU) a veľkosti študovaného územia (v metroch).
Final evaluation
We found that achieving good results requires carefully setting the parameters of the algorithms used, as the number and quality of the resulting tree positions depend heavily on these settings. If obtaining the most accurate results is the goal, a possible strategy would be to select a representative part of the study area, manually determine the tree positions, and then adjust the parameters of the respective algorithms. These optimized settings can then be used for the analysis of the entire study area.
The quality of the results is also influenced by various other factors, such as the season, which affects vegetation density, the density of trees in the area, and the species diversity of the vegetation. The quality of the results is further impacted by the quality of vegetation classification in the point cloud, as the presence of various artifacts, such as parts of buildings, roads, vehicles, and other objects, can negatively affect the results. The tree segmentation algorithms cannot always reliably filter out these artifacts.
Regarding computational efficiency, we can conclude that using an HPC environment provides a significant opportunity for accelerating the evaluation process. For illustration, processing the entire study area of Petržalka (3500x3500 m) on a single compute node of the Devana HPC system took approximately 820 seconds, utilizing all 64 CPU cores. Processing the same area in CloudCompare on a powerful PC using a single CPU core took approximately 6200 seconds, which is about 8 times slower.
Authors Marián Gall – Slovak National Supercomputing Centre Michal Malček – Slovak National Supercomputing Centre Lucia Demovičová – Centrum spoločných činností SAV v. v. i., organizačná zložka Výpočtové stredisko Dávid Murín – SKYMOVE s. r. o. Robert Straka – SKYMOVE s. r. o.
[6] Roussel J., Auty D. (2024). Airborne LiDAR Data Manipulation and Visualization for Forestry Applications.
[7] Popescu, Sorin & Wynne, Randolph. (2004). Seeing the Trees in the Forest: Using Lidar and Multispectral Data Fusion with Local Filtering and Variable Window Size for Estimating Tree Height. Photogrammetric Engineering and Remote Sensing. 70. 589-604. 10.14358/PERS.70.5.589.
[8] Silva C. A., Hudak A. T., Vierling L. A., Loudermilk E. L., Brien J. J., Hiers J. K., Khosravipour A. (2016). Imputation of Individual Longleaf Pine (Pinus palustris Mill.) Tree Attributes from Field and LiDAR Data. Canadian Journal of Remote Sensing, 42(5).
[9] Ester M., Kriegel H. P., Sander J., Xu X.. KDD-96 Proceedings (1996) pp. 226–231
[10] Kuhn H. W., “The Hungarian Method for the assignment problem”, Naval Research Logistics Quarterly, 2: 83–97, 1955
BeeGFS in Practice — Parallel File Systems for HPC, AI and Data-Intensive Workloads6 Feb-This webinar introduces BeeGFS, a leading parallel file system designed to support demanding HPC, AI, and data-intensive workloads. Experts from ThinkParQ will explain how parallel file systems work, how BeeGFS is architected, and how it is used in practice across academic, research, and industrial environments.
When a production line knows what will happen in 10 minutes5 Feb-Every disruption on a production line creates stress. Machines stop, people wait, production slows down, and decisions must be made under pressure. In the food industry—especially in the production of filled pasta products, where the process follows a strictly sequential set of technological steps—one unexpected issue at the end of the line can bring the entire production flow to a halt. But what if the production line could warn in advance that a problem will occur in a few minutes? Or help decide, already during a shift, whether it still makes sense to plan packaging later the same day? These were exactly the questions that stood at the beginning of a research collaboration that brought together industrial data, artificial intelligence, and supercomputing power.
Who Owns AI Inside an Organisation? — Operational Responsibility5 Feb-This webinar focuses on how organisations can define clear operational responsibility and ownership of AI systems in a proportionate and workable way. Drawing on hands-on experience in data protection, AI governance, and compliance, Petra Fernandes will explore governance approaches that work in practice for both SMEs and larger organisations. The session will highlight internal processes that help organisations stay in control of their AI systems over time, without creating unnecessary administrative burden.