In this paper, we explore the effect of using different convolutional layers, batch normalization and the global average pooling layer upon a convolutional neural network (CNN) based gaze tracking system. A novel method is proposed to label the participant's face images as gaze points retrieved from eye tracker while watching videos for building a training dataset that is closer to human visual behavior. The participants can swing their head freely; therefore, the most real and natural images can be obtained without too many restrictions. The labeled data are classified according to the coordinate of gaze and area of interest on the screen. Therefore, varied network architectures are applied to estimate and compare the effects including the number of convolutional layers, batch normalization (BN) and the global average pooling (GAP) layer instead of the fully connected layer. Three schemes, including the single eye image, double eyes image and facial image, with data augmentation are used to feed into neural network to train and evaluate the efficiency. The input image of the eye or face for an eye tracking system is mostly a small-sized image with relatively few features. The results show that BN and GAP are helpful in overcoming the problem to train models and in reducing the amount of network parameters. It is shown that the accuracy is significantly improved when using GAP and BN at the mean time. Overall, the face scheme has a highest accuracy of 0.883 when BN and GAP are used at the mean time. Additionally, comparing to the fully connected layer set to 512 cases, the number of parameters is reduced by less than 50% and the accuracy is improved by about 2%. A detection accuracy comparison of our model with the existing George and Routray methods shows that our proposed method achieves better prediction accuracy of more than 6%.
- Batch normalization
- Convolution neural network
- Gaze tracking
- Global average spooling layer