Representation learning through cross-modality supervision

Abstract

Learning robust representations for applications with multiple modalities of input can have a significant impact on its performance. Traditional representation learning methods rely on projecting the input modalities to a common subspace to maximize agreement amongst the modalities for a particular task. We propose a novel approach to representation learning that uses a latent representation decoder to reconstruct the target modality and thereby employs the target modality purely as a supervision signal for discovering correlations between the modalities. Through cross modality supervision, we demonstrate that the learnt representation is able to improve the performance of the task of facial action unit (AU) recognition when compared with the modality specific representations and even their fused counterparts. Our experiments on three AU recognition datasets - MMSE, BP4D and DISFA, show strong performance gains producing state-of-the-art results in spite of the absence of a modality..

Publication
In IEEE International Conference on Automatic Face & Gesture Recognition, 2019
Deen Dayal Mohan
Deen Dayal Mohan
PhD student, Department of Computer Science

My research interests include computer vision, multimodal representation learning and biometric.