Perceptual Analysis of Speaker Embeddings for Voice Discrimination between Machine And Human Listening

Abstract:

This study investigates the information captured by speaker embeddings with relevance to human speech perception. A Convolutional Neural Network was trained to perform one-shot speaker verification under clean and noisy conditions, such that high-level abstractions of speaker-specific features were encoded in a latent embedding vector. We demonstrate that robust and discriminative speaker embeddings can be obtained by using a training loss function that optimizes the embeddings for similarity scoring during inference. Computational analysis showed that such speaker embeddings predicted various hand-crafted acoustic features, while no single feature explained substantial variance of the embeddings. Moreover, the relative distances in the speaker embedding space moderately coincided with voice similarity, as inferred by human listeners. These findings confirm the overlap between machine and human listening when discriminating voices and motivate further research on the remaining disparities for improving model performance.