Kategórie
Success-Stories

Named Entity Recognition for Address Extraction in Speech-to-Text Transcriptions Using Synthetic Data

Named Entity Recognition for Address Extraction in Speech-to-Text Transcriptions Using Synthetic Data

Many businesses spend large amounts of resources for communicating with clients. Usually, the goal is to provide clients with information, but sometimes there is also a need to request specific information from them. In addressing this need, there has been a significant effort put into the development of chatbots and voicebots, which on one hand serve the purpose of providing information to clients, but they can also be utilized to contact a client with a request to provide some information. A specific real-world example is to contact a client, via text or via phone, to update their postal address. The address may have possibly changed over time, so a business needs to update this information in its internal client database.

illustrative image

Nonetheless, when requesting such information through novel channels|like chatbots or voicebots| it is important to verify the validity and format of the address. In such cases, an address information usually comes by a free-form text input or as a speech-to-text transcription. Such inputs may contain substantial noise or variations in the address format. To this end it is necessary to lter out the noise and extract corresponding entities, which constitute the actual address. This process of extracting entities from an input text is known as Named Entity Recognition (NER). In our particular case we deal with the following entities: municipality name, street name, house number, and postal code. This technical report describes the development and evaluation of a NER system for extraction of such information.

Problem Description and Our Approach

This work is a joint effort of Slovak National Competence Center for High-Performance Computing and nettle, s.r.o., which is a Slovak-based start-up focusing on natural language processing, chatbots, and voicebots. Our goal is to develop highly accurate and reliable NER model for address parsing. The model accepts both free text as well as speech-to-text transcribed text. Our NER model constitutes an important building block in real-world customer care systems, which can be employed in various scenarios where address extraction is relevant.

The challenging aspect of this task was to handle data which was present exclusively in Slovak language. This makes our choice of a baseline model very limited. Currently, there are several publicly available NER models for the Slovak language. These models are based on the general purpose pre-trained model SlovakBERT [1]. Unfortunately, all these models support only a few entity types, while the support for entities relevant to address extraction is missing. A straightforward utilization of popular Large Language Models (LLMs) like GPT is not an option in our use cases because of data privacy concerns and time delays caused by calls to these rather time-consuming LLM APIs.

We propose a fine-tuning of SlovakBERT for NER. The NER task in our case is actually a classification task at the token level. We aim at achieving proficiency at address entities recognition with a tiny number of real-world examples available. In Section 2.1 we describe our dataset as well as a data creation process. The significant lack of available real-world data prompts us to generate synthetic data to cope with data scarcity. In Section 2.2 we propose SlovakBERT modifications in order to train it for our task. In Section 2.3 we explore iterative improvements in our data generation approach. Finally, we present model performance results in Section 3.

Data

The aim of the task is to recognize street names, house numbers, municipality names, and postal codes from the spoken sentences transcribed via speech-to-text. Only 69 instances of real-world collected data were available. Furthermore, all of those instances were highly affected by noise, e.g., natural speech hesitations and speech transcription glitches. Therefore, we use this data exclusively for testing. Table 1 shows two examples from the collected dataset.

Table 1: Two example instances from our collected real-world dataset. The Sentence column show- cases the original address text. The Tokenized text column contains tokenized sentence representation, and the Tags column contains tags for the corresponding tokens. Note here that not every instance necessarily contains all considered entity types. Some instances contain noise, while others have gram- mar/spelling mistakes: The token \ Dalsie" is not a part of an address and the street name \bauerova" is not capitalized.

Artificial generation of training dataset occurred as the only, but still viable option to tackle the problem of data shortage. Inspired by the 69 real instances, we programmatically conducted numerous external API calls to OpenAI to generate similar realistic-looking examples. BIO annotation scheme [2] was used to label the dataset. This scheme is a method used in NLP to annotate tokens in a sequence as the beginning (B), inside (I), or outside (O) of entities. We are using 9 annotations: O, B-Street, I-Street, B-Housenumber, I-Housenumber, B-Municipality, I-Municipality, B-Postcode, I-Postcode.

We generated data in multiple iterations as described below in Section 2.3. Our final training dataset consisted of more than 104 sentences/address examples. For data generation we used GPT3.5-turbo API along with some prompt engineering. Since the data generation through this API is limited by the number of tokens — both generated as well as prompt tokens—we could not pass the list of all possible Slovak street names and municipality names within the prompt. Hence, data was generated with placeholders streetname and municipalityname only to be subsequently replaced by randomly chosen street and municipality names from the list of street and municipality names, respectively. A complete list of Slovak street and municipality names was obtained from the web pages of the Ministry of Interior of the Slovak republic [3].

With the use of OpenAI API generative algorithm we were able to achieve organic sentences without the need to manually generate the data, which sped up the process significantly. However, employing this approach did not come without downsides. Many mistakes were present in the generated dataset, mainly wrong annotations occurred and those had to be corrected manually. The generated dataset was split, so that 80% was used for model’s training, 15% for validation and 5% as synthetic test data, so that we could compare the performance of the model on real test data as well as on artificial test data.

Model Development and Training

Two general-purpose pre-trained models were utilized and compared: SlovakBERT [1] and a distilled version of this model [4]. Herein we refer to the distilled version as DistilSlovakBERT. SlovakBERT is an open-source pretrained model on Slovak language using a Masked Language Modeling (MLM) objective. It was trained with a general Slovak web-based corpus, but it can be easily adapted to new domains to solve new tasks [1]. DistilSlovakBERT is a pre-trained model obtained from SlovakBERT model by a method called knowledge distillation, which significantly reduces the size of the model while retaining 97% of its language understanding capabilities.

We modified both models by adding a token classification layer, obtaining in both cases models suitable for NER tasks. The last classification layer consists of 9 neurons corresponding to 9 entity annotations: We have 4 address parts and each is represented by two annotations – beginning and inside of each entity, and one for the absence of any entity. The number of parameters for each model and its components are summarized in Table 2.

Table 2: The number of parameters in our two NER models and their respective counts for the base model and the classication head.

Models’ training was highly susceptible to overfitting. To tackle this and further enhance the training process we used linear learning rate scheduler, weight decay strategies, and some other hyperparameter tuning strategies.

Computing resources of the HPC system Devana, operated by the Computing Centre, Centre of operations of the Slovak Academy of Sciences were leveraged for model training, specifically utilizing a GPU node with 1 NVidia A100 GPU. For a more convenient data analysis and debugging, an interactive environment using OpenOnDemand was employed, which allows researches remote web access to supercomputers.

The training process required only 10-20 epochs to converge for both models. Using the described HPC setting, one epoch’s training time was on average 20 seconds for 9492 samples in the training dataset for SlovakBERT and 12 seconds for DistilSlovakBERT. Inference on 69 samples takes 0.64 seconds for SlovakBERT and 0.37 seconds for DistilSlovakBERT, which demonstrates model’s efficiency in real-time NLP pipelines.

Iterative Improvements

Although only 69 instances of real data were present, the complexity of it was quite challenging to imitate in generated data. The generated dataset was created using several different prompts, resulting in 11,306 sentences that resembled human-generated content. The work consisted of a number of iterations. Each iteration can be split into the following steps: generate data, train a model, visualize obtained prediction errors on real and artificial test datasets, and analyze. This way we identified patterns that the model failed to recognize. Based on these insights we generated new data that followed these newly identified patterns. The patterns we devised in various iterations are presented in Table 3. With each newly expanded dataset both of our models were trained, with SlovakBERT’s accuracy always exceeding the one of DistilSlovakBERT’s. Therefore, we have decided to further utilize only SlovakBERT as a base model.

Results

The confusion matrix corresponding to the results obtained using model trained in Iteration 1 (see Table 3)—is displayed in Table 4. This model was able to correctly recognize only 67.51% of entities in test dataset. Granular examination of errors revealed that training dataset does not represent the real-world sentences well enough and there is high need to generate more and better representative data. In Table 4 it is evident, that the most common error was identification of a municipality as a street. We noticed that this occurred when municipality name appeared before the street name in the address. As a result, this led to data generation with Iteration 2 and Iteration 3.

Table 3: The iterative improvements of data generation. Each prompt was used twice: First with and then without noise, i.e., natural human speech hesitations. Sometimes, if mentioned, prompt allowed to shue or omit some address parts.

This process of detailed analysis of prediction errors and subsequent data generation accounts for most of the improvements in the accuracy of our model. The goal was to achieve more than 90% accuracy on test data. Model’s predictive accuracy kept increasing with systematic data generation. Eventually, the whole dataset was duplicated, with the duplicities being in uppercase/lowercase. (The utilized pre-trained model is case sensitive and some test instances contained street and municipality names in lowercase.) This made the model more robust to the form in which it receives input and led to final accuracy of 93.06%. Confusion matrix of the final model can be seen in Table 5.

Table 4: Confusion matrix of model trained on dataset from the rst iteration, reaching model's predictive accuracy of 67.51%.
Table 5: Confusion matrix of the nal model with the predictive accuracy of 93.06%. Comparing the results to the results in Table 4, we can see that the accuracy increased by 25.55%.

There are still some errors; notably, tokens that should have been tagged as outside were occasionally misclassified as municipality. We have opted not to tackle this issue further, as it happens on words that may resemble subparts of our entity names, but, in reality, do not represent entities themselves. See an example below in Table 6.

Table 6: Examples of the nal model's predictions for two test sentences. The rst sentence contains one incorrectly classied token: the third token \Kal" with ground truth label O was predicted as B-Municipality. The misclassication of \Kal" as a municipality occurred due to its similarity to subwords found in \Kalsa", but ground truth labeling was based on context and authors' judgment. The second sentence has all its tokens classied correctly.

Conclusions

In this technical report we trained a NER model built upon SlovakBERT pre-trained LLM model as the base. The model was trained and validated exclusively on artificially generated dataset. This well representative and high quality synthetic data was iteratively expanded. Together with hyperparameter fine-tuning this iterative approach allowed us to reach predictive accuracy on real dataset exceeding 90%. Since the real dataset contained a mere 69 instances, we decided to use it only for testing. Despite the limited amount of real data, our model exhibits promising performance. This approach emphasizes the potential of using exclusively synthetic dataset, especially in cases where the amount of real data is not sufficient for training.

This model can be utilized in real-world applications within NLP pipelines to extract and verify the correctness of addresses transcribed by speech-to-text mechanisms. In case a larger real-world dataset is available, we recommend to retrain the model and possibly also expand the synthetic dataset with more generated data, as the existing dataset might not represent potentially new occurring data patterns. This model can be utilized in real-world applications within NLP pipelines to extract and verify the correctness of addresses transcribed by speech-to-text mechanisms. In case a larger real-world dataset is available, we recommend to retrain the model and possibly also expand the synthetic dataset with more generated data, as the existing dataset might not represent potentially new occurring data patterns.
The model is available on https://huggingface.co/nettle-ai/slovakbert-address-ner

Acknowledgement

The research results were obtained with the support of the Slovak National competence centre for HPC, the EuroCC 2 project and Slovak National Supercomputing Centre under grant agreement 101101903-EuroCC 2-DIGITAL-EUROHPC-JU-2022-NCC-01.

AUTHORS

Bibiána Lajčinová – Slovak National Supercomputing Centre

Patrik Valábek – Slovak National Supercomputing Centre, ) Institute of Information Engineering, Automation, and Mathematics, Slovak University of Technology in Bratislava

Michal Spišiak – nettle, s. r. o.

Full version of the article SK
Full version of the article EN

References::

[1] Matús Pikuliak, Stefan Grivalsky, Martin Konopka, Miroslav Blsták, Martin Tamajka, Viktor Bachratý, Marián Simko, Pavol Balázik, Michal Trnka, and Filip Uhlárik. Slovakbert: Slovak masked language model. CoRR, abs/2109.15254, 2021.

[2] Lance Ramshaw and Mitch Marcus. Text chunking using transformation-based learning. In Third Workshop on Very Large Corpora, 1995.

[3] Ministerstvo vnútra Slovenskej republiky. Register adries. https://data.gov.sk/dataset/register-adries-register-ulic. Accessed: August 21, 2023.

[4] Ivan Agarský. Hugging face model hub. https://huggingface.co/crabz/distil-slovakbert, 2022. Accessed: September 15, 2023.


Kategórie
Success-Stories

Anomaly Detection in Time Series Data: Gambling prevention using Deep Learning

Anomaly Detection in Time Series Data: Gambling prevention using Deep Learning

Gambling prevention of online casino players is a challenging ambition with positive impacts both on player’s well-being, and for casino providers aiming for responsible gambling. To facilitate this, we propose an unsupervised deep learning method with an objective to identify players showing signs of problem gambling based on available data in a form of time series. We compare the transformer-based autoencoder architecture for anomaly detection proposed by us with recurrent neural network and convolutional neural network autoencoder architectures and highlight its advantages. Due to the fact that the players’ clinical diagnosis was not part of the data at hand, we evaluated the outcome of our study by analyzing correlation of anomaly scores obtained from the autoencoder and several proxy indicators associated with the problem gambling reported in the literature.

illustrative image

Gambling prevention of players with problem or pathological gambling, currently conceptualized as a behavioural pattern where individuals stake an object of value (typically money) on the uncertain prospect of a larger reward [1], [2], is of high societal importance. Research over the past decade has revealed multiple similarities between pathological gambling and the substance use disorders [3]. With the high accessibility of the Internet, the incidence of pathological gambling has increased. This disorder can result in significant negative consequences for the affected individual and his/her family too. Therefore detecting early warning signs of problem gambling is crucial for maintaining player’s wellbeing. This work is a joint effort of Slovak National Competence Center for High-performance Computing, DOXXbet, ltd. – sports betting and online casino, and Codium, ltd. – software developer of the DOXXbet sports betting and iGaming platform, with the goal to enhance customer service and players’ engagement via identification and prevention of gambling behaviour. This proof of concept is a foundation for future tools, which will help casino mitigate negative consequences for players, even for a price of less provision for the provider, as in line with European trends in risk management related to problem gambling.

In our study we propose a completely unsupervised deep learning approach using transformer-based AE architecture to detect anomalies in the dataset - players with anomalous behaviour. The dataset at hand does not comprehend the clinical diagnosis, and amongst other proxy indicators mentioned before only few are available - requests to increase spending limits, chasing losses by gambling more (referred to as chasing episodes later in this article), usage of multiple payment methods, frequent withdrawals of small amount of money and other mentioned later in the text. Clearly, not all the anomalous users must necessarily have problem gambling, hence the proxy indicators are used in combination with AE results, namely the anomaly score. The foundation of our approach rests on the idea that a compulsive gambler is an anomaly within the active casino players, with the literature mentioning their fraction amongst all players being between 0.5% to 5% for chancebased games.

Data

The data acquired for this research consist of sequences of data points collected over time, tracking multiple aspects of player’s behaviour such as frequency and timing of their gaming activities, frequency and amount of cash deposits, payment methods used when depositing cash, information about the bets, wins, losses, withdrawals and requests for change of deposit limit. Feature engineering resulted in 19 features in a form of time series (TS), so that each feature consists of multiple time stamps. These features can be classified into three categories - ”time”, ”money” and ”despair”, as inspired by Seth et al. [7]. Table 1 summarizes the full set of TS features with a short explanation. Each feature is a sequence of N values, where each value stands for one out of N consecutive time windows. This value was produced by aggregating daily data in the respective time window, with the time window length being specified in the Table 1 together with the information about the time window being sliding or not. Hence, for each sample we needed a history of N time windows. Feature engineering procedure is displayed in Figure 1 and the final data shape is depicted in Figure 2.

Figure 1: Visualization of the data aggregation from daily basis into time windows, and eventually to TS features. t1, …, t450 represent time stamps for daily data x1, ..., x450. Daily data points from a time window are aggregated into a single value zi for all i ∈ (1, . . . , 8).1, …, x450. Denné záznamy z časového okna sú agregované do jednej hodnoty zi pre všetky i ∈ (1, . . . , 8).
Figure 2: Final data shape obtained after feature engineering. Each sample is represented by 19 features consisting of 8 time windows.

AE models comparison

Autoencoder is a "self-supervised" deep learning method suitable for anomaly detection in the Czech Republic. The idea behind using this type of neural network for anomaly detection is based on the model's reconstruction capability. AE learns to reconstruct the data in the training set and since the training set should ideally only contain "normal" observations, the model learns to reconstruct only such observations correctly. Therefore, when the input observation is anomalous, the trained AE model cannot reconstruct this input sufficiently correctly, resulting in a high reconstruction error. This reconstruction error can be used as an anomaly score for the given observation, where a higher score means a higher probability that the observation deviates from the general trend.

In the study, we trained an AE model based on transformers, where both the encoder and decoder contain a layer called "Multi Head Attention" with four "heads" and 32-dimensional key and value vectors. This layer is followed by a classical neural network with so-called "dropout" layers and residual connections. The entire AE model has just over 100k trainable parameters.

Reconstruction loss and Prediction ability

We performed a 3-fold cross-validation by splitting the data into training, validation, and test sets, and trained the models for each split to assess their stability. Resulting average loss values and their variances are displayed in the Table 3. The average reconstruction error of Transformer model is significantly lower than all the other models. LSTM B model comes second in the reconstruction performance and CNN model seems to have the worst prediction performance. Generally, the test loss is observed to be always higher than train and validation losses. The reason for this is that those 211 data points that were removed from the training set in the data cleaning process, were moved to the test set. Without moving these samples, the test loss for transformer-based model would be as low as 0.012, for CNN model 0.33, for LSTM A model 0.27, and for LSTM B model 0.13. More detailed overview of the models’ performance is displayed on the Figure 6 as histograms of loss values of the test set. All histograms have heavy right tail, which is expected for datasets containing anomalies.

Figure 3: Reconstruction error histograms of the transformer-based AE model for the test set. On the x-axis is the value of the anomaly score and on the y-axis is the frequency of the corresponding value.

To demonstrate the quality of the CR reconstruction, the original (blue line) and predicted (red line) values for a randomly selected anomalous observation of one player are shown in Figure 4. The value of the anomaly score for the respective models is given in the caption of the graphs.

Figure 4: Comparison of the predictive ability of AE models. All models reconstructed the same observation coming from the test set. Predictive ability: the blue line represents the input data, the red reconstruction obtained using the transformer-based AE model. The number shown in the graph header represents the anomaly score for that data sample.

Results

Since clinical diagnosis was not part of the data we had, we can only rely on auxiliary indicators to identify players with potentially problem gambling. We approached this task by detecting anomalies in the data, but we are aware that not all anomalies necessarily indicate a gambling problem. Therefore, we will correlate the results of the AE model with the following auxiliary indicators:

  • Mean number of logins in a time window.
  • Mean number of withdrawals in a time window.
  • Mean number of small and frequent withdrawals in a time window.
  • Mean number of requests for the change of the deposit limit in a time window.
  • Sum of the chasing episodes in the time slot of N time window

Figure 5 depicts the correlation of the anomaly score with the proxy indicators. Each subplot contains 10 bars, each bar representing one decile of the data samples (i.e. each bar represents 10% of data samples sorted by anomaly score). The bar colors represent the category value of the respective proxy indicator.

(a)
(b)
(c)
(d)
(e)
Figure 5: Each bar in the graphs represents one decile of the anomaly score (MSE). The colors represent the categories of the relevant auxiliary indicators, with category values specified in the legend.

A distinctive pattern in players’ behavior can be observed, where players with larger anomaly scores tend to exhibit high values for all the indicators evaluated. Higher frequency of logins is proportionate to higher anomaly score with more than half of the players in the last decile of reconstruction error having a mean number of logins in a time window greater than 50. The same applies for mean number of cash withdrawals in a time window. Players with low anomaly score have almost none or very few withdrawals, whilst more than one fourth of players in the last anomaly score decile have two or more withdrawals in a time window on average. Another secondary indicator we utilize is the number of small and frequent withdrawals. Most of the players with at least one of these events is in 10% of players with the highest MSE. When analyzing another indicator, namely the number of requests for a deposit limit change, we observe a more subtle pattern. It is evident that players in the first five deciles generally have no requests for a limit change (with very few exceptions), while as the anomaly score increases, the frequency of limit change requests also tends to rise. The last proxy indicator depicted is the number of chasing episodes. A rising frequency of these events proportionate to their anomaly score can be observed. More than half of the players in the last decile have at least one chasing episode in the time window.

If these plots are overlapped in order to identify the portion of players fulfilling multiple proxy indicators, following observations result: in the last five percentiles of the anomaly scores 98.6% of players satisfy at least one proxy indicator, and 77.3% satisfy at least three indicators. As for the last two percentiles, so 2% of players with the highest reconstruction error, almost 90% of them satisfy at least three indicators. The thresholds used to calculate these proportion are >= 1 chasing episode, >= 1 limit change, >= 1 small and frequent withdrawal, >= 31 logins and >= 1.25 withdrawal on average per time window.

Conclusion

In this work, we successfully applied a transformer-based autoencoder (AE) to detect anomalies in the dataset of online casino players. The aim was to detect problem gamblers in dataset at hand in an unsupervised manner. 19 features were derived from the raw time series (TS) data reflecting players’ behavior in the context of time, money and despair. We compared the performance of this architecture with three other AE architectures based on LSTM and convolutional layers and found that the transformer-based AE achieved the best results amongst the four models in terms of reconstruction capability. This model also showcases high correlation with proxy indicators such as the number of logins, number of player’s withdrawals, number of chasing episodes and other, that are commonly mentioned in literature in relation to the gambling disorder. This alignment of AE’s anomaly score with proxy indicators enables us to gain insights into prediction’s effectiveness in identifying players with potential problem gambling. Even though these proxy indicators were also used as predictors, we suggest to use them as a secondary check when detecting players with potential problem gambling in order to avoid false positives, as not all anomalies must be linked to the condition of gambling disorder. Our findings demonstrate the potential of transformer-based AEs for unsupervised anomaly detection tasks in TS data, particularly in the context of online casino player behavior.

Full version of the article

References::

[1] Alex Blaszczynski and Lia Nower. “A Pathways Model of Problem and Pathological Gambling”. In: Addiction (Abingdon, England) 97 (June 2002), pp. 487–99. doi: 10.1046/j.1360-0443.2002.00015.x.

[2] National Research Council. Pathological Gambling: A Critical Review. Washington, DC: The National Academies Press, 1999. isbn: 978-0-309-06571-9. doi: 10 . 17226 / 6329. url: https ://nap .nationalacademies.org/catalog/6329/pathological – gambling – a – critical -review.

[3] Luke Clark et al. “Pathological Choice: The Neuroscience of Gambling and Gambling Addiction”. In: Journal of Neuroscience 33.45 (2013), pp. 17617–17623. issn: 0270-6474. doi:  0.1523/JNEUROSCI.3231-13.2013.eprint: https : / / www . jneurosci . org /content / 33 / 45 / 17617 . full . pdf. url: https://www.jneurosci.org/content/33/45/17617.

[4] Deepanshi Seth et al. “A Deep Learning Framework for Ensuring Responsible Play in Skill-based Cash Gaming”. In: 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA) (2020), pp. 454–459.


Kategórie
Success-Stories

Measurement of microcapsule structural parameters using artificial intelligence (AI) and machine learning (ML)

Measurement of microcapsule structural parameters using artificial intelligence (AI) and machine learning (ML)

The main aim of collaboration between the National Competence Centre for HPC (NCC HPC) and the Institute of Polymers of SAV (IP SAV) was design and implementation of a pilot software solution for automatic processing of polymer microcapsules images using artificial intelligence (AI) and machine learning (ML) approach. The microcapsules consist of semi-permeable polymeric membrane which was developed at the IP SAV.

illustrative image

Automatic image processing has several benefits for IP SAV. It will save time since manual measurement of microcapsule structural parameters is time-consuming due to a huge number of images produced during the process. In addition, the automatic image processing will minimize the errors which are inevitably connected with manual measurements. The images from optical microscope obtained with 4.0 zoom usually contain one or more microcapsules, and they represent an input for AI/ML process. On the other hand, the images from optical microscope obtained with 2.5 zoom usually contain (three to seven) microcapsules. Herein, a detection of the particular microcapsule is essential.

The images from optical microscope are processed in two steps. The first one is a localization and detection of the microcapsule, the second one consists of a series of operations leading to obtaining structural parameters of the microcapsules.

Microcapsule detection

YOLOv5 model with pre-trained weights from COCO128 dataset was employed for microcapsule detection. Training set consisted of 96 images, which were manually annotated using graphical image annotation tool LabelImg [3]. Training unit consisted of 300 epochs, images were subdivided into 6 batches per 16 images and the image size was set to 640 pixels. Computational time of one training unit on the NVIDIA GeForce GTX 1650 GPU was approximately 3.5 hours.

The detection using the trained YOLOv5 model is presented in Figure 1. The reliability of the trained model, verified on 12 images, was 96%, with the throughput on the same graphics card being approximately 40 frames per second.

Figure 1: (a) microcapsule image from optical microscope (b) detected microcapsule (c) cropped detected microcapsule for 4.0 zoom, (d) microcapsule image from optical microscope (e) detected microcapsule (f) cropped detected microcapsule for 2.5 zoom.

Measurement of microcapsule structural parameters using AI/ML

The binary masks of inner and outer membrane of the microcapsules are created individually, as an output from the deep-learning neural network of the U-Net architecture [4]. This neural network was developed for image processing in biomedicine applications. The first training set for the U-Net neural network consisted of 140 images obtained from 4.0 zoom with the corresponding masks and the second set consisted of 140 images obtained from 2.5 zoom with the corresponding masks. The training unit consisted of 200 epochs, images were subdivided into 7 batches per 20 images and the image size was set to 1280 pixels (4.0 zoom) or 640 pixels (2.5 zoom). The 10% of the images were used for validation. Reliability of the trained model, verified on 20 images, exceeded 96%. Training process lasted less than 2 hours on the HPC system with IBM Power 7 type nodes, and it had to be repeated several times. Obtained binary masks were subsequently post-processed using fill-holes [5] and watershed [6] operations, to get rid of the unwanted residues. Subsequently, the binary masks were fitted with an ellipse using scikit-image measure library [7]. First and second principal axis of the fitted ellipse are used for the calculation of the microcapsule structural parameters. An example of inner and outer binary masks, and the fitted ellipses is shown in Figure 2.

Figure 2: (a) input image from optical microscope (b) inner binary mask (c) outer binary mask (d) output image with fitted ellipses.

Structural parameters obtained by our AI/ML approach (denoted as “U-Net“) were compared to the ones obtained by manual measurements performed at the IP SAV. A different model (denoted as “Retinex”) was used as another independent source of reference data. The Retinex approach was implemented by RNDR. Andrej Lúčny, PhD. from the Department of Applied Informatics of the Faculty of Mathematics, Physics and Informatics in Bratislava. This approach is not based on the AI/ML, the ellipse fitting is performed by the aggregation of line elements with low curvature using so-called retinex filler [8]. The Retinex approach is a good reference due to its relatively high precision, but it is not fully automatic, especially for the inner membrane of the microcapsule.

Figure 3 summarizes a comparison between the three approaches (U-Net, Retinex, UP SAV) to obtain the 4.0 zoom microcapsule structural parameters.

(a)
(b)
(c)

Figure 3: (a) microcapsule diameter for different batches (b) difference between the diameters of the fitted ellipse (first principal axis) and microcapsule (c) difference between the diameters of the fitted ellipse (second principal axis) and microcapsule. Red lines in (b) and (c) represents the threshold given by IP SAV. The images were obtained using 4.0 zoom.

All obtained results, except 4 images of batch 194 (ca 1.5%), are within the threshold defined by the IP SAV. As can be seen from Figure 3(a), the microcapsule diameters calculated using U-net and Retinex are in a good agreement to each other. The U-Net model performance can be significantly improved in future, either by the training set expansion or by additional post-processing. The agreement between the manual measurement and the U-Net/Retinex may be further improved by unifying the method of obtaining microcapsule structural parameters from binary masks.

The AI/ML model will be available as a cloud solution on the HPC systems of CSČ SAV. Additional investment into the HPC infrastructure of IP SAV will not be necessary. Production phase, which goes beyond the scope of the pilot solution, accounts for an integration of this approach into the desktop application.

References::

[1] https://github.com/ultralytics/yolov5

[2] https://www.kaggle.com/ultralytics/coco128

[3] https://github.com/heartexlabs/labelImg

[4] https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/

[5] https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.binary_fill_holes.html

[6] https://scikit-image.org/docs/stable/auto_examples/segmentation/plot_watershed.html

[7] https://scikit-image.org/docs/stable/api/skimage.measure.html

[8] D.J. Jobson, Z. Rahman, G.A. Woodell, IEEE Transactions on Image Processing 6 (7) 965-976, 1997.


Kategórie
Success-Stories

Use case: Transfer and optimization of CFD calculations workflow in HPC environment

Use case: Transfer and optimization of CFD calculations workflow in HPC environment

Authors: Ján Škoviera (National competence centre for HPC), Sylvain Suzan (Shark Aero)

Shark Aero company designs and manufactures ultralight sport aircrafts with two-seat tandem cockpit. For design development they use popular open-source software package openFOAM [1]. The CFD (Computational Fluid Dynamics) simulations use the Finite Elements Method (FEM). After the model is created, using a Computer-Aided Design (CAD) software, it is divided into discrete cells, so called “mesh”. The simulation accuracy depends strongly on mesh density with the computational and memory requirements rising with the 3rd power of the number of mesh vertices. For some simulations the computational demands can be a limiting factor. Workflow transfer into High-Performance Computing (HPC) environment was thus undertaken, with a special focus on the investigation of computational tasks parallelization efficiency for a given model type.

METHODS

Compute nodes with 2x6 cores Intel Xeon L5640 @ 2,27GHz, 48 GB RAM and 2x500 GB were used for this project. All calculations were done in a standard HPC environment using Slurm job scheduling system. This is an acceptable solution for this type of workloads where no real-time response, nor immediate data processing is required. For the CFD simulations we continued to use OpenFOAM & ParaView version 9 software packages. Singularity container was used for calculation deployment, having in mind potential transfer of the workload to another HPC system. The speed-up gained from just straight away transfer to HPC system was approximately 1.5x compared to a standard laptop.

PARALLLEZIATION

Parallelized task execution can increase the speed of the overall calculation by utilizing more computing units concurrently. In order to parallelize the task one needs to divide the original mesh into domains - parts that will be processed concurrently. The domains, however, need to communicate through the processor boundaries i.e. domain sides where the original enclosing mesh was divided. The larger the processor boundary surface is, the more I/O is required in order to resolve the boundary conditions. Processor boundary communication is facilitated by the distributed memory Message Passing Interface (MPI) protocol, and the distinction of difference between CPU cores and different compute nodes is abstracted from user. This leads to certain limitations on efficient usage of many parallel processes, since overly parallelized job executions can be actually slower due to communication and I/O bottlenecks. Therefore, the domains should be created in a way that minimizes the processor boundaries. One possible strategy is to divide the original mesh only in co-planar direction with the smallest side of the original enclosing mesh. By careless division into domains the amount of data to be transferred increases beyond reasonable measure. If one chooses to use mesh division in multiple axes, one also creates more processor boundaries.

Figure 1: Illustration of mesh segmentation. The encoling mesh is represented by the transparent boxes

The calculations were done in four steps: enclosing mesh creation, mesh segmentation, model inclusion and CFD simulation. The enclosing mesh creation was done using the blockMesh utility, the mesh segmentation step was done using the decomposePar utility, the model inclusion was done using the snappyHexMesh program, and the CFD simulation itself was done using SimpleFoam. The most computationally demanding step is snappyHexMesh. This is understandable from the fact that while in CFD simulation the calculation needs to be done several times for every edge of the mesh and every iteration, in the case of model inclusion one creates new vertices and deletes old ones based on the position of vertices in the model mesh. This requires creation of an “octree” (partitioning of three-dimensional space by recursively subdividing it into eight octants), repeated inverse search, and octree re-balancing. Each of these processes is N*log(N) in the best case scenario, and N2 in the worst case, N being the number of vertices. The CFD itself scales linearly with number of edges, i.e. “close to” linearly with N (only spatially proximate nodes are interconnected).2 We developed a workflow that creates a number of domains that can be directly parallelized with the yz plane (x being the axis of the aircraft nose), which simplifies the decision making. After inclusion of a new model, one can simply specify the number of domains and run the calculation minimizing the human intervention needed to parallelize the calculation.

RESULTS AND CONCLUSION

The relative speedup of the processes calculation is mainly determined by limited I/O. If the computational tasks are well below I/O bounding, the speed is inversely proportional to the number of domains. In less demanding calculations, i.e. for small models, the processes can be easily over-parallelized.

Figure 2: Dependence of real elapsed time on the number of processes for snappyHexMesh and simpleFoam. In the case of simpleFoam the time starts to diverge for more than 8 processes, since the data trafic overcomes the paralellization advantage. Ideal scaling shows the theoretical time needed to finish the calculation, if the data trafic and processor boundary condition resolution was not involved.

Once the mesh density is high enough, the time to calculate the CFD step is also inversely proportional to the number of parallel processes. As shown in the second pair of figures with twofold increase in mesh density, the calculations are below I/O bounding even in the CFD step. Even though the CFD step is in this case comparatively fast to the meshing process, the calculation of long time intervals could make it the most time consuming step.

The aircraft parts design requires simulations of a relatively small models multiple times under altering conditions. The mesh density needed for these simulations falls into medium category. When transferring the calculations to the HPC environment, we had to take into account the real needs of the end user in terms of model size, mesh density and result precision required. There are several advantages of using HPC:

  • The end user is relieved of the need to maintain his own computational capacities.
  • Even when restricted to single thread jobs the simulations can be offloaded to HPC with high speed up, making even very demanding and precise calculations feasible.
  • For even more effective calculations a simple way of utilizing parallelization was determined, for this particular workload. Limitations of parallel runs for the given use case and conditions were identified. The total increase in speed that was reached in practical conditions is 7.3 times. The speed-up generally grows with the calculation complexity and the mesh precision.


Kategórie
Success-Stories

MEMO98

MEMO98

MEMO98 is a non-profit non-government organisation that has been monitoring the media in context of elections and other events for more than 20 years, and has carried out its activities in more than 50 countries. Recently, the organisation has also been dealing with the impact of social media on the integrity of electoral processes.

The information environment has significantly changed in recent years, especially due to the advent of social media. Apart from some positive aspects, such as the enhanced possibilities of receiving and sharing information, social media has also enabled the dissemination of misinformation to a wide audience quickly and at low cost. MEMO98 analysed the election campaign of the parliamentary elections held on July 11, 2021 in Moldova on five social media platforms: Facebook, Instagram, Odnoklassniki, Telegram and YouTube.

Social media data was collected using CrowdTangle (a Facebook-owned social media analysis tool). The number of posts interactions of candidates and individual political parties on Facebook alone was 1.82 million. The number of posts interactions of party chairmen climbed to 1.09 million. Prior to the start of this project, MEMO98 had no experience with using tools for big data processing and analysis. NCC experts helped design a solution for data processing and visualization utilizing the freely available software Gephi [1] in the HPC environment. The output is a so-called network map, an interactive scheme for finding and analysing the dissemination of specific terms and web addresses in the context of the election campaign. As part of the project, NCC also provided access to computing resources for solution testing, as well as individual training so that MEMO98 can work independently with this solution in the HPC environment in the future.

Preliminary results and conclusions of the monitoring are published by MEMO98 on its website [2].

References


[1] Bastian M., Heymann S., Jacomy M. (2009). Gephi: an open source software for exploring and manipulating networks. International AAAI Conference on Weblogs and Social Media.

[2] Network mapping, Moldova Early Parliamentary Elections July 2021, Monitoring of Social Media – Preliminary Findings. Available here:

https://memo98.sk/article/moldovan-social-media-reflected-a-division-in-society

https://memo98.sk/uploads/content_galleries/source/memo/moldova/2021/preliminary-findings-on-the-monitoring-of-parliamentary-elections-2021-on-social-media.pdf