-
oa Pokerface: The Word-Emotion Detector
- Publisher: Hamad bin Khalifa University Press (HBKU Press)
- Source: Qatar Foundation Annual Research Conference Proceedings, Qatar Foundation Annual Research Conference Proceedings Volume 2016 Issue 1, Mar 2016, Volume 2016, ICTSP2894
Abstract
Every day, humans interact with text spanning from different sources such as news, literature, education, and even social media. While reading, humans process text word by word, accessing the meaning of a particular word from the lexicon, and when needed, changing its meaning to match the context of the text (Harley, 2014). The process of reading can induce a range of emotions, such as engagement, confusion, frustration, surprise or happiness. For example, when readers come across unfamiliar jargon, this may confuse them, as they try to understand the text.
In the past, scientists have addressed the emotion in text from a writer's perspective. For example the field of Sentiment Analysis, aims to detect the emotional charge of words, to infer the intentions of the writer. However, here we propose the reverse approach: detect emotions produced on readers while processing text.
Detecting which emotions are induced by reading a piece of text can give us insights about the nature of the text itself. A word-emotion detector can be used to assign specific emotions experienced by readers to specific words or passages of text. This area of research has never been explored before.
There are many potential applications to a word-emotion detector. For example, a word-emotion detector can be used to analyze how passages in books, news or social media are perceived by readers. This can guide stylistic choices to cater for a particular audience. In a learning environment, it can be used to detect the affective states and emotions of students, so as to infer their level of understanding which can be used to provide assistance to students over difficult passages. In a commercial environment, it can be used to detect reactions to wording in advertizements. In the remainder of this report, we detail the first steps we followed to build a word-emotion detector. Moreover, we present the details of our system developed during QCRI's 2015 Hot Summer Cool Research internship program, as well as the initial experiments. In particular, we describe our experimental setup in which viewers watch a foreign language video with modified subtitles containing deliberate emotion inducing changes. We analyze the results and provide discussion about the future work.
The Pokerface System
A Pokerface is an inscrutable face that reveals no hint of a person's thoughts or feelings. The goal of the ‘Pokerface’ project is to build a word-emotion detector that works even if no facial movements are present. To do so, the Pokerface system uses a unique symbiose of the latest consumer-level technologies such as: eye-tracking to detect words that are being read; electroencephalography (EEG) to detect brain activity of the reader; and facial-expression recognition (FER) to detect movement in a reader's face. We then classify the brain activity and facial movements detected into emotions using Neural Networks.
In this report, we present the details of our Pokerface system, as well as the initial experiments done during QCRI's 2015 Hot Summer Cool Research internship program. In particular, we describe the setup in which viewers watch a foreign language video with subtitles containing deliberate emotion inducing changes.
Methodology
To detect emotions experienced by readers as they read text, we used different technologies. FER and EEG are used to detect emotional reactions through changes in facial expressions and brainwaves, while eye-tracking is used to identify the stimulus (text) to the reaction detected. A video interface was created to run the experiments. Below we describe each of them independently, and how we used them in the project.
EEG
EEG is the recording of electrical activity along the scalp (Niedermeyer and Lopes da Silva, 2005). EEG measures voltage fluctuations resulting from ionic current flows within the neurons of the brain. EEG is one of the few non-intrusive techniques available that provides a window on physiological brain activity. EEG averages the response from many neurons as they communicate, measuring the electrical activity by surface electrodes. We can then use the brain activity of a user to detect their emotional status.
Data Gathering
In our experiments, we used the Emotiv | EEG EPOC neuroheadset (2013), which has 14 EEG channels plus two references, inertial sensors, and two gyroscopes. The raw data from the neuroheadset was parsed with the timestamps for each sample.
Data Cleaning and Artifact Removal
After retrieving the data from the EEG, we need to remove “artifacts” which are changes in the signals that do not originate from neurons (Vidal, 1977), such as ocular movements, muscular movements, as well as technical noise. To do so, we used the open source toolbox EEGlab (Delorme & Makeig, 2004) to remove artifacts and filtering (removing the 4–45 Hz line noise).
ERP Collection
We decided to consider remaining artifacts as random noise and move forward with extracting Event Related Potentials (ERPs), since all of other options we had found required some level of manual intervention. ERPs are the relevant sections of our EEG data with regards to stimuli and the subjects' reaction time. To account for random effects form the artifacts, we averaged the ERPs over different users and events. To do so, we used EEGlab's plugin ERPlab, to add events codes to our continuous EEG data based on stimulus time.
Events
Our events were defined as textual modifications in subtitles, designed to induce emotions of confusion, frustration or surprise. The time at which the subject looks at a word was marked to be the stimulus time (st) for that word, and the reaction time was marked to be st+800 ms, because we rarely see a reaction to a stimulus 800 ms after its appearance (Fischler and Bradley, 2006).
The ERPs were obtained as the average of different events corresponding to the same condition (control or experimental).
Eye-Tracking
An eye-tracker is an instrument to detect the movements of the eye. Based on the nature of the eye and human vision, the eye-tracker identifies where a user is looking by shining a light that will be reflected into the eye, such that the reflection will be captured by image sensors. The eye-tracker will then measure the angle between the cornea and pupil reflections to calculate a vector and identify the direction of the gaze.
In this project, we used the EyeTribe eye-tracker to identify the words a reader looked at while reading. It was set up in a Windows machine. Before an experiment, the user needs to calibrate the eye-tracker. Recalibration is necessary every time the user changes their sitting position. While the eye-tracker is running, Javascript and NodeJS were used to create a function that extracts and parses the data from the machine and prints it into a text file. This data includes the screen's x and y coordinates of the gaze; the timestamp; and an indicator of whether the gaze point is a fixation or not. The data is received at a rate of 60fps. The gaze points are used to determine which words are looked at at any specific time.
Video Interface
In our experiments, each user was presented with a video with subtitles. To create the experimental interface, we made different design choices based on previous empirical research. Therefore, we used Helvetica font, given its consistency across all platforms (Falconer, 2011), and used the font size 26 given that it improves the readability of subtitles on large desktops (Franz, 2014). We used Javascript to detect the location of each word that was displayed on the screen.
After gathering the data the experiment, we used an off-line process to detect the “collisions” between the eye-tracker gaze points and the words displayed to the user. To do so, we used both time information and coordinate information. The result was a series of words annotated with the specific time-spans in which they were looked at.
FER
Facial Expression Recognition (FER) is the process of detecting an individual's emotion by accessing their facial expressions in an image or video. In the past, FER has been used for various purposes, including psychological studies, tiredness detection, facial animation and robotics, etc.
Data Gathering
We used the Microsoft Kinect with the Kinect SDK 2.0 for capturing the individual's face. The data extracted from the Kinect provided us with color and infrared images, as well as depth data. However, for this project we only worked with the color data. The data from the Kinect was saved as a sequence of color images, recorded at a rate of 30 frames per second (fps). The code made use of the process of multithreading to ensure high fps, and low memory usage. Each image frame was assigned a timestamp in milliseconds, which was saved in a text file.
Feature Extraction
After having extracted the data from the Kinect, the images were processed so as to locate the facial landmarks. We tested the images with Face++ which is a free API for face detection, recognition and analysis. Using Face++, we were able to locate 83 facial landmarks on the images. The data obtained from the API was the name of the landmark along with it's x and y-coordinates.
The next step involved obtaining Action Units (AUs) by using the facial landmarks located through Face++. Action Units are the actions of individual muscles or groups of muscles, such as, raising the outer eyebrow, or stretching of lips etc (Cohn et al. 2001). To determine which AUs to use for FER, as well as how to calculate them, Tekalp and Ostermann's (2000) was taken as a reference.
Classification
The final step of the process was classifying the image frames into one of the eight different emotions - happiness, sadness, fear, anger, disgust, surprise, neutral and confused. We used the MATLAB Neural Network toolkit (MathWorks, Inc., 2015) and designed a simple feed-forward neural network with backpropagation.
Pilot results
EEG
In our pilot classification study, we used we experimented with the Alcoholism data used in Kuncheva and Rodriguez (2012) paper, from the UC Irvine (UCI) machine learning repository (2010) which contains ERP raw data for 10 alcoholic subjects and 10 sober subjects. We extracted the features using the interval feature extraction. Total of 96K Features were extracted from each subject's data. We Achieved around 98% accuracy with the training data.
FER
We experimented on three different individuals as well as several images from the Cohn-Kanade facial expressions database. It was found that the application had roughly 75–80% accuracy, and this accuracy could further be improved by adding more data into the training set. It was also observed that the classifier was more accurate when classifying certain emotions than some others. For example, the images depicting happiness were classified accurately more than images that depicted any other emotion. The classifier had difficulty distinguishing between fear, anger and sadness.
Conclusion
In this paper, we presented Pokerface, a word-emotion detector that can detect the emotion of users as they read text. We built a video interface that displays subtitled videos and used the EyeTribe eye-tracker to identify the word a user is looking at a certain time in the subtitles. We used the Emotiv Epoc headset to obtain EEG brainwaves from the user, and the Microsoft Kinect to obtain their facial expressions, and extracted features from both. We used Neural Networks to classify both the facial expressions and the EEG brainwaves into emotions. In the future.
Future directions of work include to improve the accuracy of the FER and EEG emotion classification components. Furthermore, the EEG results can be improved by exploring additional artifact detection and removal techniques. Furthermore, we want to integrate the whole pipeline in a seamless application that allows effortless experimentation.
Once the setup is streamlined, Pokerface can be used to explore many different applications to optimize users' experiences in education, news, advertizing, etc. For example, the word-emotion detector can be utilized in Computer-assisted learning to provide students with virtual affective support, such as detecting confusion and providing clarifications.