publications
publications by categories in reversed chronological order. generated by jekyll-scholar.
2025
- Revisiting Concept Drift in Windows Malware Detection: Adaptation to Real Drifted Malware with Minimal SamplesAdrian Shuai Li, Arun Iyengar, Ashish Kundu, and Elisa BertinoNDSS Symposium, 2025
In applying deep learning for malware classification, it is crucial to account for the prevalence of malware evolution, which can cause trained classifiers to fail on drifted malware. Existing solutions to address concept drift use active learning. They select new samples for analysts to label and then retrain the classifier with the new labels. Our key finding is that the current retraining techniques do not achieve optimal results. These techniques overlook that updating the model with scarce drifted samples requires learning features that remain consistent across pre-drift and post-drift data. The model should thus be able of disregarding specific features that, while beneficial for the classification of pre-drift data, are absent in post-drift data, thereby preventing prediction degradation. In this paper, we propose a new technique for detecting and classifying drifted malware that learns drift-invariant features in malware control flow graphs by leveraging graph neural networks with adversarial domain adaptation. We compare it with existing model retraining methods in active learning-based malware detection systems and other domain adaptation techniques from the vision domain. Our approach significantly improves drifted malware detection on publicly available benchmarks and real-world malware databases reported daily by security companies in 2024. We also tested our approach in predicting multiple malware families drifted over time. A thorough evaluation shows that our approach outperforms the state-of-the-art approaches.
2024
- Transfer Learning for Security: Challenges and Future DirectionsAdrian Shuai Li, Arun Iyengar, Ashish Kundu, and Elisa BertinoarXiv preprint arXiv:2403.00935, 2024
Many machine learning and data mining algorithms rely on the assumption that the training and testing data share the same feature space and distribution. However, this assumption may not always hold. For instance, there are situations where we need to classify data in one domain, but we only have sufficient training data available from a different domain. The latter data may follow a distinct distribution. In such cases, successfully transferring knowledge across domains can significantly improve learning performance and reduce the need for extensive data labeling efforts. Transfer learning (TL) has thus emerged as a promising framework to tackle this challenge, particularly in security-related tasks. This paper aims to review the current advancements in utilizing TL techniques for security. The paper includes a discussion of the existing research gaps in applying TL in the security domain, as well as exploring potential future research directions and issues that arise in the context of TL-assisted security solutions.
- Overcoming the lack of labeled data: Training malware detection models using adversarial domain adaptationSonam Bhardwaj, Adrian Shuai Li, Mayank Dave, and Elisa BertinoComputers & Security, 2024
Many current malware detection methods are based on supervised learning techniques, which however have certain limitations. First, these techniques require a large amount of labeled data for training which is often difficult to obtain. Second, they are not very effective when there are differences in domain distribution between new malware and known malware. To address these issues, we propose MD-ADA – a malware detection framework that leverages adversarial domain adaptation (DA). DA allows one to adapt a training malware dataset available at a domain, referred to as the source, for training a classifier in another domain, referred to as the target. DA, typically used when the target has limited training malware data available, maps the source and target datasets into a common latent space. As we use an image representation for malware binaries, MD-ADA uses a convolution neural network (CNN) providing a lossless image embedding for the source and target datasets. MD-ADA also employs a generative adversarial network (GAN) for malware classification that is suitable for scenarios with few target-labeled data where the distribution of the features is similar (homogeneous) or different (heterogeneous). We have carried out several experiments to assess the performance of MD-ADA. The experiments show that MD-ADA outperforms the fine-tuning approach with an accuracy of 99.29% on the BODMAS dataset, 89.3% for the Malevis dataset on homogeneous feature distribution, and 90.12% on the CICMalMem2022 dataset (Target) and 83.23% on the Microsoft Kaggle dataset (Target) for heterogeneous feature distribution. The observed F1-scores of 99.13% and 87.5% for homogeneous feature distributions and 91.27% and 81.7% for heterogeneous distributions indicate that the MD-ADA performance is satisfactory for both data distributions when the target has very few labeled data.
2023
- Building Manufacturing Deep Learning Models with Minimal and Imbalanced Training Data Using Domain Adaptation and Data AugmentationAdrian Shuai Li, Elisa Bertino, Rih-Teng Wu, and Ting-Yan WuIn 2023 IEEE International Conference on Industrial Technology (ICIT), 2023
Deep learning (DL) techniques are highly effective for defect detection from images. Training DL classification models, however, requires vast amounts of labeled data which is often expensive to collect. In many cases, not only the available training data is limited but may also imbalanced. In this paper, we propose a novel domain adaptation (DA) approach to address the problem of labeled training data scarcity for a target learning task by transferring knowledge gained from an existing source dataset used for a similar learning task. Our approach works for scenarios where the source dataset and the dataset available for the target learning task have same or different feature spaces. We combine our DA approach with an autoencoder-based data augmentation approach to address the problem of imbalanced target datasets. We evaluate our combined approach using image data for wafer defect prediction. The experiments show its superior performance against other algorithms when the number of labeled samples in the target dataset is significantly small and the target dataset is imbalanced.
- Machine Learning Techniques for CybersecurityElisa Bertino, Sonam Bhardwaj, Fabrizio Cicala, Sishuai Gong, and 5 more authors2023
This book explores machine learning (ML) defenses against the many cyberattacks that make our workplaces, schools, private residences, and critical infrastructures vulnerable as a consequence of the dramatic increase in botnets, data ransom, system and network denials of service, sabotage, and data theft attacks. The use of ML techniques for security tasks has been steadily increasing in research and also in practice over the last 10 years. Covering efforts to devise more effective defenses, the book explores security solutions that leverage machine learning (ML) techniques that have recently grown in feasibility thanks to significant advances in ML combined with big data collection and analysis capabilities. Since the use of ML entails understanding which techniques can be best used for specific tasks to ensure comprehensive security, the book provides an overview of the current state of the art of ML techniques for security and a detailed taxonomy of security tasks and corresponding ML techniques that can be used for each task. It also covers challenges for the use of ML for security tasks and outlines research directions. While many recent papers have proposed approaches for specific tasks, such as software security analysis and anomaly detection, these approaches differ in many aspects, such as with respect to the types of features in the model and the dataset used for training the models. In a way that no other available work does, this book provides readers with a comprehensive view of the complex area of ML for security, explains its challenges, and highlights areas for future research. This book is relevant to graduate students in computer science and engineering as well as information systems studies, and will also be useful to researchers and practitioners who work in the area of ML techniques for security tasks.
- Maximal Domain Independent Representations Improve Transfer LearningAdrian Shuai Li, Elisa Bertino, Xuan-Hong Dang, Ankush Singla, and 2 more authorsarXiv preprint arXiv:2306.00262, 2023
The most effective domain adaptation (DA) involves the decomposition of data representation into a domain independent representation (DIRep), and a domain dependent representation (DDRep). A classifier is trained by using the DIRep of the labeled source images. Since the DIRep is domain invariant, the classifier can be "transferred" to make predictions for the target domain with no (or few) labels. However, information useful for classification in the target domain can "hide" in the DDRep in current DA algorithms such as Domain-Separation-Networks (DSN). DSN’s weak constraint to enforce orthogonality of DIRep and DDRep, allows this hiding and can result in poor performance. To address this shortcoming, we developed a new algorithm wherein a stronger constraint is imposed to minimize the DDRep by using a KL divergent loss for the DDRep in order to create the maximal DIRep that enhances transfer learning performance. By using synthetic data sets, we show explicitly that depending on initialization DSN with its weaker constraint can lead to sub-optimal solutions with poorer DA performance whereas our algorithm with maximal DIRep is robust against such perturbations. We demonstrate the equal-or-better performance of our approach against state-of-the-art algorithms by using several standard benchmark image datasets including Office. We further highlight the compatibility of our algorithm with pretrained models, extending its applicability and versatility in real-world scenarios.
2022
- A capability-based distributed authorization system to enforce context-aware permission sequencesAdrian Shuai Li, Reihaneh Safavi-Naini, and Philip WL FongIn Proceedings of the 27th ACM on Symposium on Access Control Models and Technologies, 2022
Controlled sharing is fundamental to distributed systems. We consider a capability-based distributed authorization system where a client receives capabilities (access tokens) from an authorization server to access the resources of resource servers. Capability-based authorization systems have been widely used on the Web, in mobile applications and other distributed systems. A common requirement of such systems is that the user uses tokens of multiple servers in a particular order. A related requirement is the token may be used if certain environmental conditions hold. We introduce a secure capability-based system that supports "permission sequence" and "context". This allows a finite sequence of permissions to be enforced, each with their own specific context. We prove the safety property of this system for these conditions and integrate the system into OAuth 2.0 with proof-of-possession tokens. We evaluate our implementation and compare it with plain OAuth with respect to the average time for obtaining an authorization token and acquiring access to the resource.
2020
- Secure logging with security against adaptive crash attackSepideh Avizheh, Reihaneh Safavi-Naini, and Shuai LiIn Foundations and Practice of Security: 12th International Symposium, FPS 2019, Toulouse, France, November 5–7, 2019, Revised Selected Papers 12, 2020
Logging systems are an essential component of security systems and their security has been widely studied. Recently (2017) it was shown that existing secure logging protocols are vulnerable to crash attack in which the adversary modifies the log file and then crashes the system to make it indistinguishable from a normal system crash. The attacker was assumed to be non-adaptive and not be able to see the file content before modifying and crashing it (which will be immediately after modifying the file). The authors also proposed a system called SLiC that protects against this attacker. In this paper, we consider an (insider) adaptive adversary who can see the file content as new log operations are performed. This is a powerful adversary who can attempt to rewind the system to a past state. We formalize security against this adversary and introduce a scheme with provable security. We show that security against this attacker requires some (small) protected memory that can become accessible to the attacker after the system compromise. We show that existing secure logging schemes are insecure in this setting, even if the system provides some protected memory as above. We propose a novel mechanism that, in its basic form, uses a pair of keys that evolve at different rates, and employ this mechanism in an existing logging scheme that has forward integrity to obtain a system with provable security against adaptive (and hence non-adaptive) crash attack. We implemented our scheme on a desktop computer and a Raspberry Pi, and showed in addition to higher security, a significant efficiency gain over SLiC.
2018
- Towards a resilient smart homeTam Thanh Doan, Reihaneh Safavi-Naini, Shuai Li, Sepideh Avizheh, and 2 more authorsIn Proceedings of the 2018 workshop on IoT security and privacy, 2018
Best Paper Award
Today’s Smart Home platforms such as Samsung SmartThings and Amazon AWS IoT are primarily cloud based: devices in the home sense the environment and send the collected data, directly or through a hub, to the cloud. Cloud runs various applications and analytics on the collected data, and generates commands according to the users’ specifications that are sent to the actuators to control the environment. The role of the hub in this setup is effectively message passing between the devices and the cloud, while the required analytics, computation, and control are all performed by the cloud. We ask the following question: what if the cloud is not available? This can happen not only by accident or natural causes, but also due to targeted attacks. We discuss possible effects of such unavailability on the functionalities that are commonly available in smart homes, including security and safety related services as well as support for health and well-being of home users, and propose RES-Hub, a hub that can provide the required functionalities when the cloud is unavailable. During the normal functioning of the system, RES-Hub will receive regular status updates from cloud, and will use this information to continue to provide the user specified services when it detects the cloud is down. We describe an IoTivity-based software architecture that is used to implement RES-Hub in a flexible and expendable way and discuss our implementation.