NaijaFaceVoice: A Large-Scale Deep Learning Model and Database of Nigerian Faces and Voices
No Thumbnail Available
Date
Journal Title
Journal ISSN
Volume Title
Publisher
IEEE
Abstract
Description
The fusion of two or more traits in multimodal biometrics generally improves recognition accuracy. The question is, by how much? Large-scale databases are better suited for training deep learning models for better generalization and accuracy. Therefore, a large-scale multimodal database is beneficial. However, publicly available large-scale multimodal databases are scarce, especially for faces and voices. Again, because a face image is 2-D while a voice is 1-D, there is the challenge of the best way to fuse both. Therefore, improvements owing to fusion have hitherto yielded marginal improvements. This study proposes a semi-automated curation algorithm for the extraction of the faces and voices of target individuals in videos to create a large-scale face-voice database. The curation technique involves observing the positions at the time of the occurrence of the target subject’s faces and voices in videos. These positions are supplied to a MATLAB2017b script that detects the faces in the observed regions, crops, resizes, auto-labels, and writes them to the disk. A second MATLAB2017b script, extracts the audio content within the observed regions, auto-labels, and writes the voice segments to the disk. The created database named NaijaFaceVoice consists of 2,656 subjects with over 2 million faces and 195 hours of utterances. The database was employed to develop a large-scale recognition system that leveraged Convolutional Neural Networks. Robust fusion methods incorporating the proposed Spectrogram-Voting concept significantly improved performance achieving a record equal error rate of 0.0003519%, an improvement by a factor of over 450.
Keywords
TK Electrical engineering. Electronics Nuclear engineering