Please use this identifier to cite or link to this item:
Title: Deep multi-biometric fusion for audio-visual user re-identification and verification
Authors: Marras, Mirko
Marín-Reyes, Pedro A. 
Lorenzo-Navarro, Javier 
Castrillón-Santana, Modesto 
Fenu, Gianni
UNESCO Clasification: 120304 Inteligencia artificial
Keywords: Audio-visual learning
Cross-modal biometrics
Deep biometric fusion
Multi-biometric system
Re-identification, et al
Issue Date: 2020
Publisher: Springer 
Project: Identificación Automática de Oradores en Sesiones Parlamentarias Usando Características Audiovisuales. 
Journal: Lecture Notes in Computer Science 
Conference: 8th International Conference on Pattern Recognition Applications and Methods, ICPRAM 2019 
Abstract: From border controls to personal devices, from online exam proctoring to human-robot interaction, biometric technologies are empowering individuals and organizations with convenient and secure authentication and identification services. However, most biometric systems leverage only a single modality, and may face challenges related to acquisition distance, environmental conditions, data quality, and computational resources. Combining evidence from multiple sources at a certain level (e.g., sensor, feature, score, or decision) of the recognition pipeline may mitigate some limitations of the common uni-biometric systems. Such a fusion has been rarely investigated at intermediate level, i.e., when uni-biometric model parameters are jointly optimized during training. In this chapter, we propose a multi-biometric model training strategy that digests face and voice traits in parallel, and we explore how it helps to improve recognition performance in re-identification and verification scenarios. To this end, we design a neural architecture for jointly embedding face and voice data, and we experiment with several training losses and audio-visual datasets. The idea is to exploit the relation between voice characteristics and facial morphology, so that face and voice uni-biometric models help each other to recognize people when trained jointly. Extensive experiments on four real-world datasets show that the biometric feature representation of a uni-biometric model jointly trained performs better than the one computed by the same uni-biometric model trained alone. Moreover, the recognition results are further improved by embedding face and voice data into a single shared representation of the two modalities. The proposed fusion strategy generalizes well on unseen and unheard users, and should be considered as a feasible solution that improves model performance. We expect that this chapter will support the biometric community to shape the research on deep audio-visual fusion in real-world contexts.
ISBN: 978-3-030-40013-2
ISSN: 0302-9743
DOI: 10.1007/978-3-030-40014-9_7
Source: Pattern Recognition Applications and Methods. ICPRAM 2019. Lecture Notes in Computer Science, v. 11996, p. 136-157
Appears in Collections:Capítulo de libro
Show full item record


checked on Oct 2, 2022

Page view(s)

checked on Oct 1, 2022

Google ScholarTM




Export metadata

Items in accedaCRIS are protected by copyright, with all rights reserved, unless otherwise indicated.