Oxford Wave Research publications at ODYSSEY 2020

Two of our publications at the ODYSSEY 2020  Speaker and Language Recognition Workshop

Two of our collaborative papers, one on voice spoofing detection, and the other on the effects of device variability on forensic speaker comparison, are appearing at this week’s virtual ODYSSEY 2020 Speaker and Language Recognition Workshop. Video presentations for both papers are now available on the workshop website: http://www.odyssey2020.org/

The full papers, along with the rest of the conference proceedings, can be found at: https://www.isca-speech.org/archive/odyssey_2020/index.html

Bence1

In our paper with Bence Halpern (PhD student, University of Amsterdam), “Residual networks for resisting noise: analysis of an embeddings-based spoofing countermeasure,” we propose a new embeddings-based method of spoofed speech detection using Constant Q-Transform (CQT) features and a Dilated ResNet Deep Neural Network (DNN) architecture. The novel CQT-GMM-DNN approach, which uses the DNN embeddings with a Gaussian Mixed Model (GMM) classifier, performs favourably compared to the baseline system in both clean and noisy conditions. We also present some ‘explainable audio’ results, which provide insight into the information the DNN exploits for decision-making. This study shows that reliable detection of spoofed speech is increasingly possible, even in the presence of noise.

See a blog post from Bence (including some explainable audio examples) here: https://karkirowle.github.io/publication/odyssey-2020

FRIDA2

In our paper with David van der Vloed (from the Netherlands Forensic Institute), “Exploring the effects of device variability on forensic speaker comparison using VOCALISE and NFI-FRIDA, a forensically realistic database,” we investigate the effect of recording device mismatch on forensic speaker comparison with VOCALISE. Using the forensically-realistic NFI-FRIDA database, consisting of speech simultaneously-recorded on multiple devices (e.g. close-mic, far-mic, and telephone intercept, as seen in the data collection image), we demonstrate that while optimal performance is achieved by matching the relevant population recording device to the case data recording device, it is not necessary to match the precise device; broadly matching the device type is sufficient. This study presents a research methodology for how a forensic practitioner can corroborate their subjective judgment of the ‘representativeness’ of the relevant population in forensic speaker comparison casework.