Department of Computer Science and Engineering, University of South Carolina
Candidate: Shizhong Han
Advisor: Dr. Yan Tong
Recognizing facial action units (AUs) from spontaneous facial expression is a challenging problem, because of subtle facial appearance changes, free head movements, occlusions, and limited AU-coded training data. Most recently, convolutional neural networks (CNNs) have shown promise on facial AU recognition. However, CNNs are often overfitted and do not generalize well to unseen subject due to limited AU-coded training images. In order to improve the performance of facial AU recognition, we developed two novel CNN frameworks, by substituting the traditional decision layer and convolutional layer with the incremental boosting layer and adaptive convolutional layer respectively, to recognize the AUs from static image.
First, in order to handle the limited AU-coded training data and reduce the overfitting, we proposed a novel Incremental Boosting CNN (IB-CNN) to integrate boosting into the CNN via an incremental boosting layer that selects discriminative neurons from the lower layer and is incrementally updated on successive mini-batches. In addition, a novel loss function that accounts for errors from both the incremental boosted classifier and individual weak classifiers was proposed to fine-tune the IB-CNN. Experimental results on four benchmark AU databases have demonstrated that the IB-CNN yields significant improvement over the traditional CNN and the boosting CNN without incremental learning, as well as outperforming the state-of-the-art CNN-based methods in AU recognition. The improvement is more impressive for the AUs that have the lowest frequencies in the databases.
Second, all current CNNs use predefined and fixed convolutional filter size. However, AUs activated by different facial muscles cause facial appearance changes at different scales and thus favor different filter sizes. The traditional strategy is to experimentally select the best filter size for each AU in each convolutional layer, but it suffers from expensive training cost, especially when the networks become deeper and deeper. We proposed a novel Optimized Filter Size CNN (OFS-CNN), where the filter sizes and weights of all convolutional layers are learned simultaneously from the training data along with learning convolutional filters. Specifically, the filter size is defined as a continuous variable, which is optimized by minimizing the training loss. Experimental results on four AU-coded databases and one spontaneous facial expression database outperforms traditional CNNs with fixed filter sizes and achieves state-of-the-art recognition performance. Furthermore, the OFS-CNN also beats traditional CNNs using the best filter size obtained by exhaustive search and is capable of estimating optimal filter size for varying image resolution.