Data Analysis For Insider’s Misuse Detection

Tuesday, November 12, 2019 - 3:00pm to 4:00pm
Meeting Room 2267, Innovation Center

Department of Computer Science and Engineering
University of South Carolina

Author : Ahmed Saaudi
Advisor : Dr. Farkas
Date : Nov 12th, 2019
Time : 3:00 pm
Place : Meeting Room 2267, Innovation Center


Malicious insiders increasingly affect organizations by leaking classified data to unauthorized entities. Detecting insiders’ misuses in computer systems is a challenging problem. In this dissertation, we propose two approaches to detect such threats: a probabilistic graphical model-based approach and a deep learning-based approach. We investigate the logs of computer-based activities to discover patterns of misuse. We model user’s behaviors as sequences of computer-based events.

For our probabilistic graphical model-based approach, we propose an unsupervised model for insider’s misuse detection. That is, we develop Stochastic Gradient Descent method to learn Hidden Markov Models (SGD-HMM) with the goal of analyzing user log data. We propose the use of varying granularity levels to represent users’ log data: Session-based, Day-based, and Week-based. A user’s normal behavior is modeled using SGD-HMM. The model is used to detect any deviation from the normal behavior. We also propose a Sliding Window Technique (SWT) to identify malicious activity by considering the near history of the user’s activities. We evaluate the experimental results in terms of Receiver Operating Characteristic (ROC). The area under the curve (AUC) represents the model’s performance with respect to the separability of the classes. The higher the AUC scores, the better the model’s performance. Combining SGD-HMM with SWT resulted in AUC values between 0.81 and 0.9 based on the window size. Our solution is superior to current solutions based on the achieved AUC scores.

For our deep learning-based approach, we propose a supervised model for insider’s misuse detection. We present our solution using natural language processing with deep learning. We examine textual event logs to investigate the semantic meaning behind a user’s behavior. The proposed approaches consist of character embeddings and deep learning networks that involve Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM). We develop three deep-learning models: CNN, LSTM, and CNN-LSTM. We run a 10-fold subject-independent cross-validation procedure to evaluate the developed models. Moreover, we use our proposed approach to investigate networks with deeper and wider structures. For this, we study the impact of increasing the number of CNN or LSTM layers, nodes per layer, and both of them at the same time on the model performance. Our deep learning-based approach shows promising behavior. The first model, CNN, presents the best performance of classifying normal samples with an AUC score of 0.88, false-negative rate 32%, and 8% false-positive rate. The second model, LSTM, shows the best performance of detecting malicious samples with an AUC score of 0.873, false-negative rate 0%, and 37% false-positive rate. The third model, CNN-LSTM, presents a moderate behavior of detecting both normal and insider samples with an AUC score of 0.862, false-negative rate 16%, and 17% false-positive rate.

Our results indicate that machine learning approaches can be effectively deployed to detect insiders’ misuse. However, it is difficult to obtain labeled data. Furthermore, the high presence of normal behavior and limited misuse activities create a highly unbalanced data set. This impacts the performance of our models.