Optimization for threat classification of various data types-based on ML model and LLM

Chaerim Hong; Taeyeon Oh

doi:10.1038/s41598-025-05182-y

Optimization for threat classification of various data types-based on ML model and LLM

Sci Rep. 2025 Jul 2;15(1):22768. doi: 10.1038/s41598-025-05182-y.

Authors

Chaerim Hong¹, Taeyeon Oh²

Affiliations

¹ Seoul AI School, aSSIST University, Seoul, 03767, South Korea.
² Seoul AI School, aSSIST University, Seoul, 03767, South Korea. tyoh@assist.ac.kr.

Abstract

With the development of AI technology, the number of cyber security threats that exploit it is increasing rapidly, and it is urgent to build an effective security threat detection system to respond to these threats. There is active research on AI-based security tools to detect and respond to these security threats. This study explores how heterogeneous data, such as signs of security attacks from security threat news and weaknesses in source code, can be analyzed integrally in an ML model and LLM environment. In this study, we applied scaling and normalization techniques to the Post News data to improve bias, and we used syntax analysis, semantic analysis, and data flow information to perform an integrated analysis of the source code to improve detection performance. It is designed to be applied to both ML models and LLM by systematizing data labeling and data formats. The results showed that the constructed learning model performed well in both text analysis and source code analysis. In the post-news data learning, the ML-based models XGBoost, SVM, and Random Forest all showed f1-scores of 0.96 to 0.97, while the LLM-based models ST5-xxl, XLNet, BERT, CodeBERT, and GraphCodeBERT all showed a score of 0.97. Additionally, in the C/C++ weakness code detection data learning, the LLM series model ST5-xxl achieved 0.9999, XLNet achieved 0.9999, BERT achieved 0.9037, CodeBERT achieved 0.9999, and GraphCodeBERT achieved 0.9999. The ML-based model XGBoost showed an accuracy of 0.9999 with the TF-IDF embedding method, SVM showed 0.9699 with the TF-IDF embedding method, and Random Forest showed 0.9493 with the TF-IDF method. The models demonstrated higher performance with the TF-IDF embedding method than with the Word2Vec embedding. This study proposed an ML and LLM integrated framework that could effectively detect source code vulnerabilities using abstract syntax trees (AST). This framework overcame the limitations of existing static analysis tools and improved detection accuracy by simultaneously considering the structural characteristics and semantic context of the code. In particular, by combining AST-based feature extraction with LLM's natural language understanding capabilities, it improved generalization performance for new types of vulnerabilities and significantly reduced false positives.

Keywords: Coode weakness; Data bias; Large language models; Machine learning; Post news; Security weakness.