Stacked Modelling Framework

Monday, October 21, 2019 - 3:30pm to 4:30pm
Meeting Room 2265, Innovation Center

DISSERTATION DEFENSE
Department of Computer Science and Engineering
University of South Carolina

Author : Kareem Abdelfatah
Advisor : Dr. Gabriel Terejanu
Date : Oct 21, 2019
Time : 3:30 pm
Place : Meeting Room 2265, Innovation Center

Abstract

The thesis develops a predictive modeling framework based on stacked Gaussian processes and applies it to two main applications in environmental and chemical engineering. First, a network of independently trained Gaussian processes (StackedGP) is introduced to obtain analytical predictions of quantities of interest (model outputs) with quantified uncertainties. StackedGP framework supports component-based modeling in different fields such as environmental and chemical science, enhances predictions of quantities of interest through a cascade of intermediate predictions usually addressed by cokriging, and propagates uncertainties through emulated dynamical systems driven by uncertain forcing variables. By using analytical first and second-order moments of a Gaussian process with uncertain inputs using squared exponential and polynomial kernels, approximated expectations of model outputs that require an arbitrary composition of functions can be obtained. The performance of the proposed nonparametric stacked model in model composition and cascading predictions is measured in different applications and datasets. The framework has been evaluated in a wildfire and mineral resource problem using real data, and its application to time-series prediction is demonstrated in a 2D puff advection problem.

In additions, the StackedGP is introduced to one of challenging environmental problems, prediction of mycotoxins. In this part of the work, we develop a stacked Gaussian process using both field and wet-lab measurements to predict fungal toxin (aflatoxin) concentrations in corn in South Carolina. While most of the aflatoxin contamination issues associated with the post-harvest period in the U.S. can be controlled with expensive testing, a systematic and economical approach is lacking to determine how the pre-harvest aflatoxin risk adversely affects crop producers as aflatoxin is virtually unobservable on a geographical and temporal scale. This information gap carries significant cost burdens for grain producers, and it is filled by the proposed stacked Gaussian process. The novelty of this part is two folds. First, the aflatoxin probabilistic maps are obtained using an analytical scheme to propagate the uncertainty through the stacked Gaussian process. The model predictions are validated both at the Gaussian process component level and at the system level for the entire stacked Gaussian process using historical field data. Second, a novel derivation is introduced to calculate the analytical covariance of aflatoxin production at two geographical locations. Similar with kriging/Gaussian process, this is used to predict aflatoxin at unobserved locations using measurements at nearby locations but with the prior mean and covariance provided by the stacked Gaussian process. As field measurements arrive, this measurement update scheme may be used in targeted field inspections and warning farmers of emerging aflatoxin contaminations.

Lastly, we apply the stackedGP framework in a chemical engineering application. Computational catalyst discovery involves identification of a meaningful model and suitable descriptors that determine the catalyst properties. First, we study the impact of combining various descriptors (e.g. reaction energies, metal descriptors, and bond counts) for modeling transition state energies (TS) based on a database of adsorption and TS energies across transition metal surfaces {Palladium (PD_111), Platinum (PT_111), Nickel (NI_111), Ruthenium (RU_0001), and Rhodium (RH_111)} for the decarboxylation and decarbonylation of propionic acid, a chemistry characteristic for biomass conversion. Results of different machine learning models for more than 1330 of these descriptor combinations suggest that there is no statistically significant difference between linear and non-linear models when using the right combination of reactant energies, metal descriptors, and bond counts. However, linear models are inferior when not including bond count and metal descriptors. Furthermore, when there are missing data for reaction steps on all metals, conventional linear scaling is inferior to linear and nonlinear models with proper choice of descriptors that are surprisingly robust. Finally, the stackedGP framework is evaluated in modeling the adsorption and transition state energies as a function of metal descriptors with data from all metal surfaces. By getting these energies, the Turn-Over-Frequency (TOF) can be estimated using micro-kinetic models.