Center for Research in Comptuer Vision
Center for Research in Comptuer Vision

Deep Learning Human Mind for Automated Visual Classification

"Learning never exhausts the mind"

The Idea

What if we could effectively read the mind and transfer human visual capabilities to computer vision methods? In this work, we aim at addressing this question by developing the first visual object classifier driven by human brain signals. In particular, we employ EEG data evoked by visual object stimuli combined with Recurrent Neural Networks (RNN) to learn a discriminative brain activity manifold of visual categories in a reading the mind effort. Afterward, we transfer the learned capabilities to machines by training a Convolutional Neural Network (CNN)–based regressor to project images onto the learned manifold, thus allowing machines to employ human brain–based features for automated visual classification. We use a 128-channel EEG with active electrodes to record brain activity of several subjects while looking at images of 40 ImageNet object classes. The proposed RNN-based approach for discriminating object classes using brain signals reaches an average accuracy of about 83%, which greatly outperforms existing methods attempting to learn EEG visual object representations. As for automated object categorization, our human brain–driven approach obtains competitive performance, comparable to those achieved by powerful CNN models and it is also able to generalize over different visual datasets. This gives us a real hope that, indeed, human mind can be read and transferred to machines.

The architecture

LSTM-based Encoder

The EEG multi-channel temporal signals, are provided as input to the encoder module, which processes the whole time sequence and outputs an EEG feature vector as a compact representation of the input. Ideally, if an input sequence consists of the EEG signals recorded while looking at an image, our objective is to have the resulting output vector encode relevant brain activity information for discriminating different image classes. The encoder network is trained by adding, at its output, a classification module (in all our experiments, it will be a softmax layer), and using gradient descent to learn the whole model’s parameters end-to-end. In our experiments, we tested several configurations of the encoder network:
  1. Common LSTM: the encoder network is made up of a stack of LSTM layers. At each time step t, the first layer takes the input s(·, t) (in this sense, “common” means that all EEG channels are initially fed into the same LSTM layer); if other LSTM layers are present, the output of the first layer (which may have a different size than the original input) is provided as input to the second layer and so on. The output of the deepest LSTM layer at the last time step is used as the EEG feature representation for the whole input sequence.
  2. Channel LSTM and Common LSTM: the first encoding layer consists of several LSTMs, each connected to only one input channel. In this way, the output of each “channel LSTM” is a summary of a single channel’s data. The second encoding layer then performs inter-channel analysis, by receiving as input the concatenated output vectors of all channel LSTMs. As above, the output of the deepest LSTM at the last time step is used as the encoder’s output vector.
  3. Common LSTM and Output layer : similar to the common LSTM architecture, but an additional output layer (linear combinations of input, followed by ReLU nonlinearity) is added after the LSTM, in order to increase model capacity at little computational expenses (if compared to the two-layer common LSTM architecture). In this case, the encoded feature vector is the output of the final layer.

Encoder and classifier training is performed through gradient descent by providing the class label associated to the image shown while each EEG sequence was recorded. After training, the encoder can be used to generate EEG features from an input EEG sequences, while the classification network will be used to predict the image class for an input EEG feature representation, which can be computed from either EEG signals or images, as described in the next section.

Regressing Images to EEG features

We employed two CNN-based approaches to extract EEG features (or, at least, a close approximation) from an input image:

The EEG Dataset

Six subjects (five male and one female) were shown visual stimuli of objects while EEG data was recorded. All subjects were homogeneous in terms of age, education level and cultural background. The dataset used for visual stimuli was a subset of ImageNet [18], containing 40 classes of easily recognizable objects. During the experiment, 2,000 images (50 from each class) were shown in bursts for 0.5 seconds each. A burst lasted for 25 seconds, followed by a 10-second pause where a black image was shown for a total running time of 1,400 seconds (23 minutes and 20 seconds). The experiments were conducted using a 128-channel cap with active, low-impedance electrodes (actiCAP128Ch). Brainvision DAQs and amplifiers were used for the EEG data acquisition. Sampling frequency and data resolution were set, respectively, to 1000 Hz and 16 bits.


Back to Human Mind and Vision Projects