Department of Computer Science and Engineering
University of South Carolina
Author : Hatim Alsuwat
Advisor : Dr. Csilla Farkas and Dr. Marco Valtorta
Date : Oct 7, 2019
Time : 3:30 pm
Place : Meeting Room 2267, Innovation Center
In this dissertation, we investigate and address two kinds of data integrity threats. We first study the limitations of secure cryptographic shuffling algorithms regarding preservation of data dependencies. We then study the limitations of machine learning models regarding concept drift detection. We propose solutions to address these threats.
Shuffling Algorithms have been used to protect the confidentiality of sensitive data. However, these algorithms may not preserve data dependencies, such as functional dependencies and data-driven associations. We present two solutions for addressing these shortcomings: (1) Functional dependencies preserving shuffle, and (2) Data-driven associations preserving shuffle. For preserving functional dependencies, we propose a method using Boyce-Codd Normal Form (BCNF) decomposition. Instead of shuffling the original relation, we recommend to shuffle each BCNF decomposition. The final shuffled relation is constructed by joining the shuffled decompositions. We show that our approach is lossless and preserves functional dependencies if the BCNF decomposition is dependency preserving. For preserving data-driven associations, we generate the transitive closure of the sets of attributes that are associated. Attributes of each set are bundled together during shuffling.
Concept drift is a significant challenge that greatly influences the accuracy and reliability of machine learning models. There is, therefore, a need to detect concept drift in order to ensure the validity of learned models. We study the issue of concept drift in the context of discrete Bayesian networks. We propose a probabilistic graphical model framework to explicitly detect the presence of concept drift using latent variables. We employ latent variables to model real concept drift and uncertainty drift over time. For modeling real concept drift, we propose to monitor the mean of the distribution of the latent variable over time. For modeling uncertainty drift, we suggest to monitor the change in belief of the latent variable over time, i.e., we monitor the maximum value that the probability density function of the distribution takes over time. We also propose a probabilistic graphical model framework that is based on using latent variables to provide an explanation of the detected posterior probability drift across time.
Our results show that neither cryptographic shuffling algorithms nor machine learning models are robust against data integrity threats. However, our proposed approaches are capable of detecting and mitigating such threats.