Impact of log file processing on learning speed and defect classification accuracy

Kaiafiuk, А.

doi:10.30857/2786-5371.2025.2.2

Please use this identifier to cite or link to this item: https://er.knutd.edu.ua/handle/123456789/33632

Title:	Impact of log file processing on learning speed and defect classification accuracy
Other Titles:	Вплив обробки лог-файлів на швидкість навчання та точність класифікації дефектів
Authors:	Kaiafiuk, А.
Keywords:	регулярні вирази лематизація векторизація машинне навчання автоматизація тестування egular expressions lemmatisation vectorisation machine learning test automation
Issue Date:	2025
Citation:	Kaiafiuk А. Impact of log file processing on learning speed and defect classification accuracy = Вплив обробки лог-файлів на швидкість навчання та точність класифікації дефектів [Текст] / А. Kaiafiuk // Технології та інжиніринг. - 2025. - № 2 (26). - С. 27-36.
Source:	Технології та інжиніринг
Abstract:	Метою було дослідити вплив попередньої обробки лог-файлів автоматизованого тестування на швидкість векторизації та навчання моделей машинного навчання. Використано набір HDFS_v3_TraceBench, що містить понад 370 тисяч трасувань, зібраних у середовищі Hadoop Distributed File System. Обробка включала видалення шуму, лематизацію та зменшення дублікатів. Дані векторизовано методом Term frequency – inverse document frequency, після чого навчено модель RandomForestClassifier. Результати експериментів показали, що оптимізація вхідних даних дозволила зменшити загальний час обробки майже вп’ятеро. Час, необхідний для векторизації тексту та навчання моделі, скоротився, що дає змогу пришвидшити роботу з великими обсягами логів. При цьому точність класифікації не лише збереглася, а й продемонструвала незначне покращення: показники F1-score та коефіцієнта кореляції Метьюса залишилися стабільно високими. Також спостерігалося зниження значення Log Loss, що свідчило про підвищення впевненості моделі у власних прогнозах. Це особливо важливо в умовах незбалансованих класів, характерних для задач класифікації дефектів. Детальний аналіз виявив, що значна частина службової та повторюваної інформації в логах не є критичною для навчання моделі, а її видалення навпаки покращує якість підготовки даних. У ході роботи також було підтверджено, що отримані цільові мітки для логів відповідають типовим класам помилок. Реалізована обробка лог-файлів не лише скорочує обчислювальні витрати, але й підтримує або покращує якість прогнозування. Ці результати підтвердили доцільність включення етапу очищення та оптимізації логів у загальний процес побудови моделей машинного навчання для автоматизованого тестування. Отримані результати можуть бути інтегровані в автоматизовані пайплайни для класифікації дефектів і формування баг-репортів. Це сприятиме зменшенню обсягу ручної праці та підвищенню ефективності команд. The purpose of the study was to investigate the effect of automatic testing log file preprocessing on the speed of vectorisation and training of machine learning models. The HDFS_v3_TraceBench set was used, which contains more than 370 thousand traces collected in the Hadoop Distributed File System Environment. Processing included noise removal, lemmatisation, and duplication reduction. The data was vectorised using the Term frequency – inverse document frequency method, and then the RandomForestClassifier model was trained. The experimental results showed that optimising the input data reduced the total processing time by almost five times. The time required for text vectorisation and model training has been reduced, which helped to speed up work with large volumes of logs. However, the classification accuracy was not only preserved, but also showed a slight improvement: the F1-score and Matthews correlation coefficient indicators remained consistently high. There was also a decrease in the Log Loss value, which indicated an increase in the model’s confidence in its own forecasts. This is especially important in the context of unbalanced classes that are characteristic of defect classification problems. A detailed analysis showed that a significant part of the service and repetitive information in the logs is not critical for training the model, and its removal, on the contrary, improves the quality of data preparation. In the course of the study, it was also confirmed that the resulting target labels for logs correspond to typical error classes. Implemented log file processing not only reduces computational costs, but also supports or improves the quality of forecasting. These results confirmed the feasibility of including the log cleaning and optimisation step in the overall process of building machine learning models for automated testing. The results obtained can be integrated into automated pipelines for classifying defects and generating bug reports. This will help to reduce the amount of manual labour and increase the efficiency of teams.
DOI:	10.30857/2786-5371.2025.2.2
URI:	https://er.knutd.edu.ua/handle/123456789/33632
ISSN:	2786-538X
Appears in Collections:	Наукові публікації (статті) Технології та інжиніринг

Files in This Item:

File	Description	Size	Format
TI_2025_N2(26)_P027-036.pdf		1,8 MB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets