Monday, June 22, 2026 - 02:00 pm
online

DISSERTATION DEFENSE
 

Author : Ahmed Shehab Khan
Advisors: Dr. Yan Tong
Date: June 22, 2026
Time: 02:00 pm
Place: Virtual (Zoom)
Link:  https://sc-edu.zoom.us/j/86147086296

Abstract
Group Emotion Recognition (GER) is the task of inferring the collective emotional state of a group of individuals from a single image. The task has several inherent challenges. First, relevant evidence is spread across faces, body language, objects, and the surrounding scene, and no single cue is sufficient. Second, individuals and regions within a group contribute unequally to the perceived emotion, with the relative importance shifting from image to image. Third, existing methods that integrate these signals rely on multi-stream pipelines with separate detectors and networks, making inference computationally expensive. This dissertation develops three deep learning frameworks for GER, each building on the previous to address these challenges.


First, we proposed a four-stream hybrid network that combines features from individual faces, the scene, and the spatial arrangement of faces within the image. A face-location aware stream captures the relationship between faces and scene through an attention heatmap; a multi-scale face stream handles the high variance in face size found in images collected in the wild; and a global blurred stream learns scene-only features by suppressing face appearance. These four streams are combined with hand-engineered fusion weights.

Second, we proposed Regional Attention Networks with Context-aware Fusion. Building on the multi-stream approach, this work addressed two limitations: how to determine the importance of individual persons and objects within a group, and how the relative weight of different streams should depend on the image. A regional attention mechanism estimates the importance of each person or object from the image, and a context-aware fusion module replaces the fixed stream weights of the prior framework with values derived from the image content itself. To reduce computational cost, feature extraction is consolidated onto a single shared backbone.

Third, we proposed LG-GER, a language-guided distillation framework that addresses the inference cost of detector-driven multi-stream pipelines. To overcome the lack of spatial supervision in existing GER datasets, a multimodal large language model serves as an offline annotator that generates dense, spatially grounded emotion evidence (bounding boxes with emotion signals and confidence scores) for the training images. This structured evidence is distilled into a single vision-language backbone through four complementary losses: classification, region-text grounding, spatial emotion, and spatial confidence regression. At inference, the framework requires no detectors, no MLLMs, and no multi-stream fusion, making it the first detector-free framework for group emotion recognition.

All three frameworks were evaluated on publicly available GER benchmarks. Visualization and case studies illustrate how the attention and fusion components identify the most informative regions for group emotion recognition.