VOCALISE

VOCALISE (Voice Comparison and Analysis of the Likelihood of Speech Evidence) is a powerful and flexible voice biometrics software solution for performing fast and accurate speaker comparisons using Deep Neural Network (DNN) based x-vectors.
Comparison of multi-speaker files without diarisation
Operates using the state-of-the-art x‑vector PLDA
Powerful visualisation feature for i‑vector and x‑vector approaches
Condition adaptation for ‘tuning’ the i‑vector or x‑vector PLDA system
Score calibration to enable interpretation of scores across comparisons in different conditions
Three measures of objective audio quality and an overall ‘star rating’
Sort files based on objective quality metrics
Send subsets of files for analysis/comparison or write them to a new folder
Highlights

Operates using the state-of-the-art x-vector PLDA (Probabilistic Linear Discriminant Analysis) framework
Powerful visualisation feature for i-vector and x-vector approaches: view an interactive 3D point cloud of speakers
Condition adaptation for ‘tuning’ the i-vector or x-vector PLDA system to a particular or unseen recording condition
Score calibration to enable interpretation of scores across comparisons in different conditions
Supplied with Bio-Metrics software allowing performance assessment via equal error rates, Tippett plots, DET curves, and Zoo plot analysis, and calculation of likelihood ratios under same-speaker and different-speaker hypotheses
State-of-the-art algorithms
VOCALISE supports several generations of automatic speaker recognition technology, from classic approaches based on Gaussian Mixture Models (GMMs), to state-of-the-art i-vectors and x-vectors. Over the course of this technology evolution, the complexity of the systems has increased, as has the recognition performance.
Deep Neural Networks (DNNs) mark a new phase in the evolution of automatic speaker recognition technology, providing a powerful framework for extracting highly-discriminative speaker-specific features from a recording of speech. VOCALISE continues to support both legacy and state-of-the-art speaker modelling algorithms, the latest of which is a DNN-based ‘x-vector’ framework, a state-of-the-art approach that leverages a DNN to extract compact speaker representations.
The following section provides a very brief overview of the state-of-the-art speaker modelling approaches within VOCALISE, based on Kelly et al. (2019) which is available for download – also have a look at our other Publications.
An x-vector is a vector representation of a speaker’s voice extracted from an audio file. The x-vector is of fixed size (typically 512 elements), regardless of the duration of the recording. The extraction of an x-vector requires features from the audio file, along with a pre-trained Deep Neural Network (DNN).
Read more...
Artificial neural networks are collections of connected units or nodes, each of which loosely model the behaviour of the neurons in the brain. Typically, each node in the network receives one or more inputs, and outputs a weighted sum of these inputs. Training an artificial neural network involves adjusting the weight associated with each connection so that errors are minimised on a set of training data. Deep Neural Networks are artificial neural networks with multiple layers of connected nodes. This so called ‘deep’ extension allows representations of data to be learned at multiple levels of abstraction, enabling complex relationships to be modelled. The VOCALISE x-vector extractor is a feed-forward DNN consisting of nine layers.
An i-vector (where the ‘i’ refers to ‘identity’) is a vector representation of a speaker’s voice extracted from an audio file. The i-vector is of fixed size (typically 400 elements), regardless of the duration of the recording. The extraction of an i-vector requires features from the audio file, along with some pre-trained models.
‘Open-box’ architecture

VOCALISE has been developed with an ‘open-box’ architecture. The idea underlying the design is that the user should be able to change the system parameters and introduce new data at every step of the speaker recognition process. With this approach, the user is not limited to manufacturer-provided models or configurations, and has the ability to train the system specifically for their problem domain. In addition, VOCALISE interfaces with ‘trusted’ programs like Praat to allow the user to utilise features provided by such specialist software.
Flexible speaker feature extraction options
Choice of spectral and phonetic features
Spectral features
Spectral features are descriptors of the frequency characteristics that are automatically extracted from a speech sample over short time windows, and are the most commonly used feature-type in speech and speaker recognition. VOCALISE currently supports flexible Mel-frequency cepstral coefficient (MFCC) features with:
- Adjustable frequency band selection
- Optional energy, delta and delta-delta coefficients
- Cepstral mean subtraction (CMS) and variance normalisation (CMVN)
Auto-phonetic features
The use of ‘auto-phonetic’ features, i.e. phonetic features extracted in an automatic (unsupervised) way, is supported via an interface with Praat.
User provided features
User-provided features refers to the option for the user to provide their own features to the system in GMM and GMM-MAP modes.

Advanced features

- Playback and visualisation: Choose between Waveform and Spectrogram view.
- Drag-and-drop: Specify a directory or folder which contains the files to compare against, or drop the files into the interface for comparison.
- Optimised to run quickly on multi-processor machines: Allows for quick comparison of files using both spectral and phonetic features.
- Ready-to-use ‘sessions’: Pre-trained and optimised models provided. If desired, create your own custom session, from scratch or from an existing session.
- Mode switching: Easily switch from the Phonetic Analysis mode to the MFCC-based spectral mode.
- Voice Activity Detection (VAD): Pre-processing step that discards silences or low-energy speech content.
- Unique capability of using features traditionally used in forensic phonetic acoustic analysis: Includes pitch, formants, as well as spectral features such as Mel Frequency Cepstral Coefficients (MFCCs).
- Advanced configurability of MFCCs and derivatives: Configure delta and delta-delta coefficients and Cepstral Mean Subtraction (CMS) or Cepstral Mean and Variance Normalisation (CMVN).
- Apply automatic speaker recognition techniques to phonetic features: Modelling of long-term formant distributions and formant dynamics with Universal Background Models (UBMs) and GMMs, as well as with state-of-the-art i-vector and x-vector approaches.
- Result presentation: After running a comparison, the audio files are sorted by comparison score in decreasing order, with the most similar file appearing at the top of the list.
- Exporting output scores: Save, edit, and copy files, then export scores into the Bio-Metrics Software for analysis.

Selected publications using VOCALISE
Publications by OWR and some of our users
- Tomáš Nechanský, Tomáš Bořil, Alžběta Růžičková, Radek Skarnitzl, and Vojtěch Skořepa. The effect of language and temporal mismatch on LTF and ASR analyses, International Association for Forensic Phonetics and Acoustics (IAFPA) conference 2021, Marburg, Germany.
- David van der Vloed, Finnian Kelly, and Anil Alexander. Exploring the effects of device variability on forensic speaker comparison using VOCALISE and NFI-FRIDA, a forensically realistic database, Odyssey 2020: The Speaker and Language Recognition Workshop, Tokyo, Japan [to appear]. [Download Article]
- Radek Skarnitzl, Maral Asiaee, Mandana Nourbakhsh. Tuning the performance of automatic speaker recognition in different conditions, International Journal of Speech, Language and the Law, 26(2), pp. 209-229. https://doi.org/10.1558/ijsll.39778
- Finnian Kelly, Andrea Fröhlich, Volker Dellwo, Oscar Forth, Samuel Kent, and Anil Alexander. Evaluation of VOCALISE under conditions reflecting those of a real forensic voice comparison case (forensic_eval01), Speech Communication, vol. 112, pp. 30-36, September 2019. [Download Article]
- Sula Ross, Katherine Earnshaw, and Erica Gold. A Cautionary Tale For Phonetic Analysis: The Variability of Speech Between and Within Recording Sessions, 19th International Congress of the Phonetic Sciences, pp. 3090-3094, Australasian Speech Science and Technology Association Inc., August 2019.
- Francis Rumsey. Audio Forensics: Keeping up in the Age of Smartphones and Fakery, Journal of the Audio Engineering Society, 67(7/8), pp. 617-622, 2019.
- Finnian Kelly, Oscar Forth, Samuel Kent, Linda Gerlach, and Anil Alexander. Deep neural network based forensic automatic speaker recognition in VOCALISE using x-vectors, Audio Engineering Society (AES) Forensics Conference 2019, Porto, Portugal. [Download Abstract] [Download Presentation]
- Finnian Kelly, Anil Alexander, Oscar Forth, Samuel Kent, Jonas Lindh, and Joel Åkesson. Identifying Perceptually Similar Voices with a Speaker Recognition System Using Auto-Phonetic Features, INTERSPEECH 2016, pp. 1567-1568, San Francisco, USA. [Download Article]
- Anil Alexander, Oscar Forth, Alankar Atreya, and Finnian Kelly. VOCALISE: A forensic automatic speaker recognition system supporting spectral, phonetic, and user-provided features, Odyssey 2016: The Speaker and Language Workshop, Bilbao, Spain. [Download Abstract]
- Julia Forsberg, Johan Gross, Jonas Lindh, Joel Åkesson. Speaker comparison evaluation using a new corpus of urban speech, International Association for Forensic Phonetics and Acoustics (IAFPA) conference 2015, pp.46-47, Leiden, Netherlands.
- Michael Jessen, Anil Alexander, and Oscar Forth. Forensic Voice Comparisons in German with Phonetic and Automatic Features using VOCALISE software, Audio Engineering Society (AES) Forensics Conference, 2014, London, UK. [Download Abstract] [Download Presentation]
- Marianne Jessen, Oscar Forth, and Anil Alexander. VOCALISE: eine gemeinsame Plattform für die Anwendung automatischer und semiautomatischer Methoden in forensischen Stimmenvergleichen, German article in Polizei & Wissenschaft, April 2013. [Download Article]
View more publications...
- Tom Iszatt, Ekrem Malkoc, Finnian Kelly, and Anil Alexander. Exploring the impact of face coverings on x-vector speaker recognition using VOCALISE, International Association for Forensic Phonetics and Acoustics (IAFPA) conference 2021, Marburg, Germany. [Download Abstract] [Download Presentation]
- Linda Gerlach, Finnian Kelly, and Anil Alexander. More than just identity: speaker recognition and speaker profiling using the GBR-ENG database, International Association for Forensic Phonetics and Acoustics (IAFPA) conference 2019, Istanbul, Turkey. [Download Abstract] [Download Poster]
- Linda Gerlach, Finnian Kelly, and Anil Alexander. One out of many: A sliding window approach to automatic speaker recognition with multi-speaker files, International Association for Forensic Phonetics and Acoustics (IAFPA) conference 2019, Istanbul, Turkey. [Download Abstract] [Download Presentation]
- David van der Vloed, Anil Alexander, and Finnian Kelly. The effect of background data selection on the strength of evidence, International Association for Forensic Phonetics and Acoustics (IAFPA) conference 2019, Istanbul, Turkey. [Download Abstract] [Download Presentation]
- Finnian Kelly, Anil Alexander, Oscar Forth, and David van der Vloed. From i-vectors to x-vectors – a generational change in speaker recognition illustrated on the NFI-FRIDA database, International Association for Forensic Phonetics and Acoustics (IAFPA) conference 2019, Istanbul, Turkey. [Download Abstract] [Download Presentation]
- Finnian Kelly, Anil Alexander, Oscar Forth, and David van der Vloed. Speaker recognition system adaptation to unseen and mismatched recording devices in the NFI-FRIDA database, International Association for Forensic Phonetics and Acoustics (IAFPA) conference 2018, Huddersfield, UK. [Download Abstract] [Download Presentation]
- David van der Vloed, Jos Bouten, Finnian Kelly, and Anil Alexander. Forensically Realistic Inter-Device Audio (NFI-FRIDA) and initial experiments, International Association for Forensic Phonetics and Acoustics (IAFPA) conference 2018, Huddersfield, UK. [Download Abstract] [Download Presentation]
- Alankar Atreya, Oscar Forth, Samuel Kent, Finnian Kelly, and Anil Alexander. Estimating the Good, the Bad and the Ugly in Speech Recordings, International Association for Forensic Phonetics and Acoustics (IAFPA) conference 2018, Huddersfield, UK. [Download Abstract] [Download Presentation]
- Finnian Kelly, Oscar Forth, Alankar Atreya, Samuel Kent, and Anil Alexander. Automatic Speaker Profiling using i-vectors, International Association for Forensic Phonetics and Acoustics (IAFPA) conference 2017, Split, Croatia. [Download Abstract] [Download Presentation]
- Anil Alexander, Oscar Forth, Alankar Atreya, Samuel Kent, and Finnian Kelly. Automatically recognizing speakers in multi-speaker recordings, International Association for Forensic Phonetics and Acoustics (IAFPA) conference 2017, Split, Croatia. [Download Abstract] [Download Presentation]
- Finnian Kelly, Anil Alexander, Oscar Forth, Samuel Kent, Jonas Lindh, and Joel Åkesson. Automatically identifying perceptually similar voices for voice parades, International Association for Forensic Phonetics and Acoustics (IAFPA) conference 2016, York, UK. [Download Abstract] [Download Presentation]
- Anil Alexander, Oscar Forth, Alankar Atreya, and Finnian Kelly. iVOCALISE: An i-vector-based automatic speaker recognition system using spectral and phonetic features, International Association for Forensic Phonetics and Acoustics (IAFPA) conference 2016, York, UK. [Download Abstract]
- Anil Alexander, Oscar Forth, John Nash, and Neil Yager. Zooplots for Speaker Recognition with Tall and Fat Animals, International Association for Forensic Phonetics and Acoustics (IAFPA) conference 2014, Zürich, Switzerland. [Download Abstract] [Download Presentation]
- Anil Alexander, Oscar Forth, Marianne Jessen, and Michael Jessen. Speaker recognition with Phonetic and Automatic Features using VOCALISE software, International Association for Forensic Phonetics and Acoustics (IAFPA) conference 2013, Tampa, Florida, USA. [Download Abstract] [Download Presentation]