{"id":7416,"date":"2023-12-22T11:15:25","date_gmt":"2023-12-22T10:15:25","guid":{"rendered":"https:\/\/eurocc.nscc.sk\/?p=7416"},"modified":"2024-02-15T13:40:40","modified_gmt":"2024-02-15T12:40:40","slug":"identifikacia-entit-pre-extrakciu-adries-z-transkriptovanych-rozhovorov-s-vyuzitim-syntetickych-dat","status":"publish","type":"post","link":"https:\/\/eurocc.nscc.sk\/en\/identifikacia-entit-pre-extrakciu-adries-z-transkriptovanych-rozhovorov-s-vyuzitim-syntetickych-dat\/","title":{"rendered":"Named Entity Recognition for Address Extraction in Speech-to-Text Transcriptions Using Synthetic Data"},"content":{"rendered":"<div class=\"is-layout-flow wp-block-group alignfull posts-all\"><div class=\"wp-block-group__inner-container\">\n<div class=\"is-layout-flex wp-container-4 wp-block-columns\">\n<div class=\"is-layout-flow wp-block-column\" style=\"flex-basis:60%\">\n<div class=\"is-layout-flow wp-block-group alignfull\"><div class=\"wp-block-group__inner-container\">\n<p><strong>Named Entity Recognition for Address Extraction in Speech-to-Text Transcriptions Using Synthetic Data<\/strong><\/p>\n\n\n\n<p><\/p>\n<\/div><\/div>\n\n\n\n<p>Many businesses spend large amounts of resources for communicating with clients. Usually, the goal is\nto provide clients with information, but sometimes there is also a need to request specific information\nfrom them.\nIn addressing this need, there has been a significant effort put into the development of chatbots\nand voicebots, which on one hand serve the purpose of providing information to clients, but they can\nalso be utilized to contact a client with a request to provide some information.\nA specific real-world example is to contact a client, via text or via phone, to update their postal address. The address may have possibly changed over time, so a business needs to update this information\nin its internal client database. <\/p>\n\n\n\n<p> <\/p>\n<\/div>\n\n\n\n<div class=\"is-layout-flow wp-block-column\">\n<figure class=\"wp-block-image alignwide size-large\"><a href=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/Picture_4.png\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"812\" src=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/Picture_4-1024x812.png\" alt=\"\" class=\"wp-image-7435\" srcset=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/Picture_4-1024x812.png 1024w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/Picture_4-300x238.png 300w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/Picture_4-768x609.png 768w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/Picture_4-1536x1218.png 1536w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/Picture_4-2048x1624.png 2048w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/Picture_4-15x12.png 15w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/Picture_4-1200x952.png 1200w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/Picture_4-1980x1570.png 1980w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption class=\"wp-element-caption\">illustrative image<\/figcaption><\/figure>\n<\/div>\n<\/div>\n\n\n\n<p>Nonetheless, when requesting such information through novel channels|like chatbots or voicebots|\nit is important to verify the validity and format of the address. In such cases, an address information\nusually comes by a free-form text input or as a speech-to-text transcription. Such inputs may contain\nsubstantial noise or variations in the address format. To this end it is necessary to lter out the noise\nand extract corresponding entities, which constitute the actual address. This process of extracting\nentities from an input text is known as Named Entity Recognition (NER). In our particular case we\ndeal with the following entities: municipality name, street name, house number, and postal code. This\ntechnical report describes the development and evaluation of a NER system for extraction of such\ninformation.<\/p>\n\n\n\n<p><strong>Problem Description and Our Approach<\/strong><\/p>\n\n\n\n<p>This work is a joint effort of Slovak National Competence Center for High-Performance Computing\nand nettle, s.r.o., which is a Slovak-based start-up focusing on natural language processing, chatbots,\nand voicebots. Our goal is to develop highly accurate and reliable NER model for address parsing. The\nmodel accepts both free text as well as speech-to-text transcribed text. Our NER model constitutes\nan important building block in real-world customer care systems, which can be employed in various\nscenarios where address extraction is relevant. <br><br>The challenging aspect of this task was to handle data which was present exclusively in Slovak\nlanguage. This makes our choice of a baseline model very limited.\nCurrently, there are several publicly available NER models for the Slovak language. These models\nare based on the general purpose pre-trained model SlovakBERT [1]. Unfortunately, all these models\nsupport only a few entity types, while the support for entities relevant to address extraction is missing.\nA straightforward utilization of popular Large Language Models (LLMs) like GPT is not an option\nin our use cases because of data privacy concerns and time delays caused by calls to these rather\ntime-consuming LLM APIs.<\/p>\n\n\n\n<p>We propose a fine-tuning of SlovakBERT for NER. The NER task in our case is actually a classification task at the token level. We aim at achieving proficiency at address entities recognition with a\ntiny number of real-world examples available. In Section 2.1 we describe our dataset as well as a data\ncreation process. The significant lack of available real-world data prompts us to generate synthetic\ndata to cope with data scarcity. In Section 2.2 we propose SlovakBERT modifications in order to train\nit for our task. In Section 2.3 we explore iterative improvements in our data generation approach.\nFinally, we present model performance results in Section 3.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"has-normal-font-size\">Data<\/h2>\n\n\n\n<p>The aim of the task is to recognize street names, house numbers, municipality names, and postal codes\nfrom the spoken sentences transcribed via speech-to-text. Only 69 instances of real-world collected\ndata were available. Furthermore, all of those instances were highly affected by noise, e.g., natural\nspeech hesitations and speech transcription glitches. Therefore, we use this data exclusively for testing.\nTable 1 shows two examples from the collected dataset.<\/p>\n\n\n\n<p><\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/nettle1.png\"><img decoding=\"async\" loading=\"lazy\" width=\"953\" height=\"296\" src=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/nettle1.png\" alt=\"\" class=\"wp-image-7420\" srcset=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/nettle1.png 1254w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/nettle1-300x106.png 300w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/nettle1-1024x363.png 1024w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/nettle1-768x272.png 768w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/nettle1-18x6.png 18w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/nettle1-1200x425.png 1200w\" sizes=\"(max-width: 953px) 100vw, 953px\" \/><\/a><figcaption class=\"wp-element-caption\">Table 1: Two example instances from our collected real-world dataset. The Sentence column show-\ncases the original address text. The Tokenized text column contains tokenized sentence representation,\nand the Tags column contains tags for the corresponding tokens. Note here that not every instance\nnecessarily contains all considered entity types. Some instances contain noise, while others have gram-\nmar\/spelling mistakes: The token \\  Dalsie\" is not a part of an address and the street name \\bauerova\"\nis not capitalized.<\/figcaption><\/figure><\/div>\n\n\n<p>       Artificial generation of training dataset occurred as the only, but still viable option to tackle the\nproblem of data shortage. Inspired by the 69 real instances, we programmatically conducted numerous\nexternal API calls to OpenAI to generate similar realistic-looking examples. BIO annotation scheme [2]\nwas used to label the dataset. This scheme is a method used in NLP to annotate tokens in a sequence\nas the beginning (B), inside (I), or outside (O) of entities. We are using 9 annotations: O, B-Street,\nI-Street, B-Housenumber, I-Housenumber, B-Municipality, I-Municipality, B-Postcode, I-Postcode. <br><br>We generated data in multiple iterations as described below in Section 2.3. Our final training\ndataset consisted of more than 104\nsentences\/address examples. For data generation we used GPT3.5-turbo API along with some prompt engineering. Since the data generation through this API is\nlimited by the number of tokens \u2014 both generated as well as prompt tokens\u2014we could not pass the\nlist of all possible Slovak street names and municipality names within the prompt. Hence, data was\ngenerated with placeholders streetname and municipalityname only to be subsequently replaced\nby randomly chosen street and municipality names from the list of street and municipality names,\nrespectively. A complete list of Slovak street and municipality names was obtained from the web pages\nof the Ministry of Interior of the Slovak republic [3].<br><br>With the use of OpenAI API generative algorithm we were able to achieve organic sentences without\nthe need to manually generate the data, which sped up the process significantly. However, employing\nthis approach did not come without downsides. Many mistakes were present in the generated dataset,\nmainly wrong annotations occurred and those had to be corrected manually. The generated dataset was split, so that 80% was used for model\u2019s training, 15% for validation and 5% as synthetic test data,\nso that we could compare the performance of the model on real test data as well as on artificial test\ndata.<\/p>\n\n\n\n<p><strong>Model Development and Training<\/strong> <\/p>\n\n\n\n<p>Two general-purpose pre-trained models were utilized and compared: SlovakBERT [1] and a distilled\nversion of this model [4]. Herein we refer to the distilled version as DistilSlovakBERT. SlovakBERT\nis an open-source pretrained model on Slovak language using a Masked Language Modeling (MLM)\nobjective. It was trained with a general Slovak web-based corpus, but it can be easily adapted to new\ndomains to solve new tasks [1]. DistilSlovakBERT is a pre-trained model obtained from SlovakBERT\nmodel by a method called knowledge distillation, which significantly reduces the size of the model\nwhile retaining 97% of its language understanding capabilities.<br><br>We modified both models by adding a token classification layer, obtaining in both cases models\nsuitable for NER tasks. The last classification layer consists of 9 neurons corresponding to 9 entity\nannotations: We have 4 address parts and each is represented by two annotations \u2013 beginning and\ninside of each entity, and one for the absence of any entity. The number of parameters for each model\nand its components are summarized in Table 2.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><a href=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/222222222222.png\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"230\" src=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/net2.png\" alt=\"\" class=\"wp-image-7452\" srcset=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/net2.png 1172w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/net2-300x92.png 300w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/net2-1024x314.png 1024w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/net2-768x235.png 768w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/net2-18x6.png 18w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption class=\"wp-element-caption\">Table 2: The number of parameters in our two NER models and their respective counts for the base\nmodel and the classication head.<\/figcaption><\/figure><\/div>\n\n\n<p>Models\u2019 training was highly susceptible to overfitting. To tackle this and further enhance the\ntraining process we used linear learning rate scheduler, weight decay strategies, and some other hyperparameter tuning strategies.<br><br>Computing resources of the HPC system Devana, operated by the Computing Centre, Centre of\noperations of the Slovak Academy of Sciences were leveraged for model training, specifically utilizing\na GPU node with 1 NVidia A100 GPU. For a more convenient data analysis and debugging, an\ninteractive environment using OpenOnDemand was employed, which allows researches remote web\naccess to supercomputers.<br><br>The training process required only 10-20 epochs to converge for both models. Using the described\nHPC setting, one epoch\u2019s training time was on average 20 seconds for 9492 samples in the training\ndataset for SlovakBERT and 12 seconds for DistilSlovakBERT. Inference on 69 samples takes 0.64\nseconds for SlovakBERT and 0.37 seconds for DistilSlovakBERT, which demonstrates model\u2019s efficiency\nin real-time NLP pipelines.<\/p>\n\n\n\n<p><strong>Iterative Improvements<\/strong><\/p>\n\n\n\n<p>Although only 69 instances of real data were present, the complexity of it was quite challenging to\nimitate in generated data. The generated dataset was created using several different prompts, resulting\nin 11,306 sentences that resembled human-generated content. The work consisted of a number of\niterations. Each iteration can be split into the following steps: generate data, train a model, visualize\nobtained prediction errors on real and artificial test datasets, and analyze. This way we identified\npatterns that the model failed to recognize. Based on these insights we generated new data that\nfollowed these newly identified patterns. The patterns we devised in various iterations are presented\nin Table 3. With each newly expanded dataset both of our models were trained, with SlovakBERT\u2019s\naccuracy always exceeding the one of DistilSlovakBERT\u2019s. Therefore, we have decided to further utilize\nonly SlovakBERT as a base model.<\/p>\n\n\n\n<p><strong>Results<\/strong><\/p>\n\n\n\n<p>The confusion matrix corresponding to the results obtained using model trained in Iteration 1 (see\nTable 3)\u2014is displayed in Table 4. This model was able to correctly recognize only 67.51% of entities in test dataset. Granular examination of errors revealed that training dataset does not represent the\nreal-world sentences well enough and there is high need to generate more and better representative\ndata. In Table 4 it is evident, that the most common error was identification of a municipality as a\nstreet. We noticed that this occurred when municipality name appeared before the street name in the\naddress. As a result, this led to data generation with Iteration 2 and Iteration 3.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><a href=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/333333333333333333.png\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"382\" src=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/net3.png\" alt=\"\" class=\"wp-image-7464\" srcset=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/net3.png 1445w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/net3-300x104.png 300w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/net3-1024x354.png 1024w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/net3-768x266.png 768w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/net3-18x6.png 18w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/net3-1200x415.png 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption class=\"wp-element-caption\">Table 3: The iterative improvements of data generation. Each prompt was used twice: First with and\nthen without noise, i.e., natural human speech hesitations. Sometimes, if mentioned, prompt allowed\nto shue or omit some address parts.<\/figcaption><\/figure><\/div>\n\n\n<p>This process of detailed analysis of prediction errors and subsequent data generation accounts for\nmost of the improvements in the accuracy of our model. The goal was to achieve more than 90%\naccuracy on test data. Model\u2019s predictive accuracy kept increasing with systematic data generation.\nEventually, the whole dataset was duplicated, with the duplicities being in uppercase\/lowercase. (The\nutilized pre-trained model is case sensitive and some test instances contained street and municipality\nnames in lowercase.) This made the model more robust to the form in which it receives input and led\nto final accuracy of 93.06%. Confusion matrix of the final model can be seen in Table 5.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><a href=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/44444444444c.png\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"523\" src=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/netX.png\" alt=\"\" class=\"wp-image-7467\" srcset=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/netX.png 1147w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/netX-300x201.png 300w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/netX-1024x686.png 1024w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/netX-768x514.png 768w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/netX-18x12.png 18w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption class=\"wp-element-caption\">Table 4: Confusion matrix of model trained on dataset from the rst iteration, reaching model's\npredictive accuracy of 67.51%.<\/figcaption><\/figure><\/div>\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><a href=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/555555555555555555555555.png\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"522\" src=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/net5.png\" alt=\"\" class=\"wp-image-7477\" srcset=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/net5.png 1161w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/net5-300x191.png 300w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/net5-1024x654.png 1024w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/net5-768x490.png 768w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/net5-18x12.png 18w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption class=\"wp-element-caption\">Table 5: Confusion matrix of the nal model with the predictive accuracy of 93.06%. Comparing the\nresults to the results in Table 4, we can see that the accuracy increased by 25.55%.<\/figcaption><\/figure><\/div>\n\n\n<p>There are still some errors; notably, tokens that should have been tagged as outside were occasionally misclassified as municipality. We have opted not to tackle this issue further, as it happens\non words that may resemble subparts of our entity names, but, in reality, do not represent entities\nthemselves. See an example below in Table 6.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><a href=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/666666666666666666666z.png\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"392\" src=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/N6.png\" alt=\"\" class=\"wp-image-7473\" srcset=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/N6.png 1345w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/N6-300x126.png 300w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/N6-1024x432.png 1024w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/N6-768x324.png 768w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/N6-18x8.png 18w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2023\/12\/N6-1200x506.png 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption class=\"wp-element-caption\">Table 6: Examples of the nal model's predictions for two test sentences. The rst sentence contains\none incorrectly classied token: the third token \\Kal\" with ground truth label O was predicted as\nB-Municipality. The misclassication of \\Kal\" as a municipality occurred due to its similarity to\nsubwords found in \\Kalsa\", but ground truth labeling was based on context and authors' judgment.\nThe second sentence has all its tokens classied correctly.<\/figcaption><\/figure><\/div>\n\n\n<p><strong>Conclusions<\/strong><\/p>\n\n\n\n<p>In this technical report we trained a NER model built upon SlovakBERT pre-trained LLM model as\nthe base. The model was trained and validated exclusively on artificially generated dataset. This well\nrepresentative and high quality synthetic data was iteratively expanded. Together with hyperparameter fine-tuning this iterative approach allowed us to reach predictive accuracy on real dataset exceeding\n90%. Since the real dataset contained a mere 69 instances, we decided to use it only for testing.\nDespite the limited amount of real data, our model exhibits promising performance. This approach\nemphasizes the potential of using exclusively synthetic dataset, especially in cases where the amount\nof real data is not sufficient for training.<br><br>This model can be utilized in real-world applications within NLP pipelines to extract and verify the\ncorrectness of addresses transcribed by speech-to-text mechanisms. In case a larger real-world dataset\nis available, we recommend to retrain the model and possibly also expand the synthetic dataset with\nmore generated data, as the existing dataset might not represent potentially new occurring data\npatterns. This model can be utilized in real-world applications within NLP pipelines to extract and verify the\ncorrectness of addresses transcribed by speech-to-text mechanisms. In case a larger real-world dataset\nis available, we recommend to retrain the model and possibly also expand the synthetic dataset with\nmore generated data, as the existing dataset might not represent potentially new occurring data\npatterns.<br>The model is available on <a href=\"https:\/\/huggingface.co\/nettle-ai\/slovakbert-address-ner\">https:\/\/huggingface.co\/nettle-ai\/slovakbert-address-ner<\/a><\/p>\n\n\n\n<p><strong>Acknowledgement<\/strong><\/p>\n\n\n\n<p>The research results were obtained with the support of the Slovak National competence centre for\nHPC, the EuroCC 2 project and Slovak National Supercomputing Centre under grant agreement\n101101903-EuroCC 2-DIGITAL-EUROHPC-JU-2022-NCC-01.<\/p>\n\n\n\n<p><strong>AUTHORS <\/strong><\/p>\n\n\n\n<p>Bibi\u00e1na Laj\u010dinov\u00e1 \u2013  Slovak National Supercomputing Centre<\/p>\n\n\n\n<p>Patrik Val\u00e1bek \u2013  Slovak National Supercomputing Centre, ) Institute of Information Engineering, Automation, and Mathematics, Slovak University of Technology in Bratislava <\/p>\n\n\n\n<p>Michal Spi\u0161iak &#8211; nettle, s. r. o. <\/p>\n<\/div><\/div>\n\n\n\n<div class=\"is-layout-flow wp-block-group alignfull posts-all\"><div class=\"wp-block-group__inner-container\">\n<p><a href=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2024\/02\/Nettle_phase1_SK.pdf\">Full version of the article SK<\/a><br><a href=\"https:\/\/arxiv.org\/pdf\/2402.05545.pdf\">Full version of the article EN<\/a><\/p>\n\n\n\n<p><strong>References:<\/strong>:<\/p>\n\n\n\n<p>[1] Matu\u0301s Pikuliak, Stefan Grivalsky, Martin Konopka, Miroslav Blsta\u0301k, Martin Tamajka, Viktor Bachraty\u0301, Maria\u0301n Simko, Pavol Bala\u0301zik, Michal Trnka, and Filip Uhla\u0301rik. Slovakbert: Slovak masked language model. CoRR, abs\/2109.15254, 2021.<br><br>[2] Lance Ramshaw and Mitch Marcus. Text chunking using transformation-based learning. In Third Workshop on Very Large Corpora, 1995.<br><\/p>\n\n\n\n<p>[3] Ministerstvo vnu\u0301tra Slovenskej republiky. Register adries. https:\/\/data.gov.sk\/dataset\/register-adries-register-ulic. Accessed: August 21, 2023.<br><br>[4] Ivan Agarsky\u0301. Hugging face model hub. https:\/\/huggingface.co\/crabz\/distil-slovakbert, 2022. Accessed: September 15, 2023.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><br><\/p>\n\n\n\n<div class=\"is-horizontal is-content-justification-center is-layout-flex wp-container-6 wp-block-buttons\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"\/en\/success-stories\/\">Success-Stories<\/a><\/div>\n<\/div>\n\n\n<div class=\"display-posts-listing grid\"><div class=\"listing-item\"><a class=\"image\" href=\"https:\/\/eurocc.nscc.sk\/en\/ked-ai-klope-na-branu-teologie\/\"><img width=\"300\" height=\"164\" src=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/06\/Gemini_Generated_Image_b5pyykb5pyykb5py-300x164.png\" class=\"attachment-medium size-medium wp-post-image\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/06\/Gemini_Generated_Image_b5pyykb5pyykb5py-300x164.png 300w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/06\/Gemini_Generated_Image_b5pyykb5pyykb5py-1024x559.png 1024w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/06\/Gemini_Generated_Image_b5pyykb5pyykb5py-768x419.png 768w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/06\/Gemini_Generated_Image_b5pyykb5pyykb5py-1536x838.png 1536w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/06\/Gemini_Generated_Image_b5pyykb5pyykb5py-2048x1117.png 2048w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/06\/Gemini_Generated_Image_b5pyykb5pyykb5py-18x10.png 18w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/06\/Gemini_Generated_Image_b5pyykb5pyykb5py-1200x655.png 1200w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a> <a class=\"title\" href=\"https:\/\/eurocc.nscc.sk\/en\/ked-ai-klope-na-branu-teologie\/\">Ke\u010f AI klope na br\u00e1nu teol\u00f3gie<\/a> <span class=\"date\">11 Jun<\/span> <span class=\"excerpt-dash\">-<\/span> <span class=\"excerpt\">Pre\u010do umel\u00e1 inteligencia patr\u00ed aj do teol\u00f3gie?<\/span><\/div><div class=\"listing-item\"><a class=\"image\" href=\"https:\/\/eurocc.nscc.sk\/en\/novy-recept-na-skrotenie-slnecnej-energie\/\"><img width=\"300\" height=\"164\" src=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/06\/Gemini_Generated_Image_27v41727v41727v4-300x164.png\" class=\"attachment-medium size-medium wp-post-image\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/06\/Gemini_Generated_Image_27v41727v41727v4-300x164.png 300w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/06\/Gemini_Generated_Image_27v41727v41727v4-1024x559.png 1024w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/06\/Gemini_Generated_Image_27v41727v41727v4-768x419.png 768w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/06\/Gemini_Generated_Image_27v41727v41727v4-1536x838.png 1536w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/06\/Gemini_Generated_Image_27v41727v41727v4-2048x1117.png 2048w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/06\/Gemini_Generated_Image_27v41727v41727v4-18x10.png 18w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/06\/Gemini_Generated_Image_27v41727v41727v4-1200x655.png 1200w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/06\/Gemini_Generated_Image_27v41727v41727v4-1980x1080.png 1980w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a> <a class=\"title\" href=\"https:\/\/eurocc.nscc.sk\/en\/novy-recept-na-skrotenie-slnecnej-energie\/\">Nov\u00fd recept na skrotenie slne\u010dnej energie<\/a> <span class=\"date\">11 Jun<\/span> <span class=\"excerpt-dash\">-<\/span> <span class=\"excerpt\">Umel\u00e1 inteligencia v spojen\u00ed s \u010distou fyzikou dok\u00e1\u017ee predpoveda\u0165 silu slne\u010dn\u00e9ho \u017eiarenia r\u00fdchlej\u0161ie a lacnej\u0161ie ne\u017e kedyko\u013evek predt\u00fdm<\/span><\/div><div class=\"listing-item\"><a class=\"image\" href=\"https:\/\/eurocc.nscc.sk\/en\/umela-inteligencia-a-superpocitac-ako-nova-zbran-proti-ekologickym-havariam\/\"><img width=\"300\" height=\"164\" src=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/03\/kl-300x164.png\" class=\"attachment-medium size-medium wp-post-image\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/03\/kl-300x164.png 300w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/03\/kl-768x420.png 768w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/03\/kl-18x10.png 18w, https:\/\/eurocc.nscc.sk\/wp-content\/uploads\/2026\/03\/kl.png 908w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a> <a class=\"title\" href=\"https:\/\/eurocc.nscc.sk\/en\/umela-inteligencia-a-superpocitac-ako-nova-zbran-proti-ekologickym-havariam\/\"><strong>Artificial Intelligence and a Supercomputer as a New Weapon Against Environmental Disasters<\/strong><\/a> <span class=\"date\">26 Mar<\/span> <span class=\"excerpt-dash\">-<\/span> <span class=\"excerpt\">Scientists from Nitra, Slovakia are teaching machines to predict industrial failures before they can cause damage. Thanks to collaboration with the European supercomputer LUMI, they have developed a digital \u201cguardian\u201d capable of detecting pipeline leaks or manufacturing faults with high accuracy\u2014helping protect both the environment and companies\u2019 budgets.<\/span><\/div><\/div>\n<\/div><\/div>","protected":false},"excerpt":{"rendered":"<p>Many businesses spend large amounts of resources for communicating with clients. Usually, the goal is\nto provide clients with information, but sometimes there is also a need to request specific information\nfrom them.\nIn addressing this need, there has been a significant effort put into the development of chatbots\nand voicebots, which on one hand serve the purpose of providing information to clients, but they can\nalso be utilized to contact a client with a request to provide some information.\nA specific real-world example is to contact a client, via text or via phone, to update their postal address. The address may have possibly changed over time, so a business needs to update this information\nin its internal client database. <\/p>","protected":false},"author":2,"featured_media":7435,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"templates\/template-full-width.php","format":"standard","meta":[],"categories":[9],"tags":[],"_links":{"self":[{"href":"https:\/\/eurocc.nscc.sk\/en\/wp-json\/wp\/v2\/posts\/7416"}],"collection":[{"href":"https:\/\/eurocc.nscc.sk\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/eurocc.nscc.sk\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/eurocc.nscc.sk\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/eurocc.nscc.sk\/en\/wp-json\/wp\/v2\/comments?post=7416"}],"version-history":[{"count":82,"href":"https:\/\/eurocc.nscc.sk\/en\/wp-json\/wp\/v2\/posts\/7416\/revisions"}],"predecessor-version":[{"id":8478,"href":"https:\/\/eurocc.nscc.sk\/en\/wp-json\/wp\/v2\/posts\/7416\/revisions\/8478"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/eurocc.nscc.sk\/en\/wp-json\/wp\/v2\/media\/7435"}],"wp:attachment":[{"href":"https:\/\/eurocc.nscc.sk\/en\/wp-json\/wp\/v2\/media?parent=7416"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/eurocc.nscc.sk\/en\/wp-json\/wp\/v2\/categories?post=7416"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/eurocc.nscc.sk\/en\/wp-json\/wp\/v2\/tags?post=7416"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}