Jan. 13, 2021
For this year's final project in "Intro to Graduate ML", Daniel Jeong, Anush Devadhasan, and myself were tasked with the facial expression recognition problem (check out our in-depth presentation here: https://youtu.be/mPtWvemach8). The objective is fairly simple: given a 48x48 greyscale image of a face, predict the emotion on that face. The amount of variation between faces and features our brains automatically account for makes this task deceptively difficult for machines. Originally, we used the FER2013 dataset but quickly discovered that it was mired with non-face images and incorrect labels (which seems to indicate that the dataset was gathered and labeled by a machine, ironically). But even if everything was properly curated and labeled, the problem definition is flawed: people aren't usually displaying just one emotion. Microsoft recognized this in 2016 and created the FER+ dataset, a probabilistic labeling of the original FER2013 dataset where 10 crowdsourced individuals gauged the expression in each image. The startling result is that people seemed to have different interpretations of emotion as well, because most images are split between multiple categories (with a large bias toward Neutral labels). The shortcomings of humans and machines for this task spurred us to find a more machine-interpretable mode of emotion learning. Specifically, we wanted to find a continuous, low-dimensional space that allowed us to represent and interpolate emotive facial expressions in a way that was machine-interpretable. Enter the discriminator-VAE, a VAE with a 2-pronged decoder, one which would decode the latent space for an image reconstruction and another which would attempt to pull the softmax FER+ prediction from the encoding. The results were conducive to clustering and allowed image reconstruction, which we used to project multiple different emotions onto a single face. However, while some clusters showed clear emotions, others showed prominent representations of face orientation, age, and sex. Overall this was a step above other methods for representing human emotion, but the small/unbalanced dataset prevented us from exploring further at the end of our semester.