VOCALISE: Automatic Speaker Recognition Software
‘VOCALISE’ – (Voice Comparison and Analysis of the Likelihood of Speech Evidence) is a forensic automatic speaker recognition system, built for the Windows platform, that allows users to perform comparisons using both ‘traditional’ forensic phonetic parameters and ‘automatic’ spectral features in a semi- or fully automatic way.
- Highly accurate speaker recognition with carefully pre-trained background models
- Capable of comparisons based on both spectral and phonetic features
- Fast one to many or many to many comparisons using multi-threading
- Inbuilt Voice Activity Detection (VAD) to ensure that only speech is compared
- Easy drag-and-drop based comparisons
- Results in a sorted list consisting of the most similar audio files first and then in decreasing order
- Click and listen and compare the original audio and the test audio files side-by-side
- Powerful 3D visualization of the results
- Easily switch between state-of-the-art and legacy speaker modelling algorithms
- Fully compatible with BioMetrics, to calculate performance metrics and likelihood ratios
One of the major criticisms of the application of automatic speaker recognition is that the underlying algorithms form a black box into which the forensic examiner is unable to look, or indeed adapt to their own requirements. VOCALISE has been developed with an ‘open-box’ architecture. The idea underlying the design is that the user should be able to change the system parameters and introduce new data at every step of the speaker recognition process. With this approach, the user is not limited to manufacturer-provided models or configurations, and has the ability to train the system specifically for their problem domain. In addition, VOCALISE interfaces with ‘trusted’ programs like Praat to allow the user to utilise features provided by such specialist software.
State of the Art Algorithms
The latest version of VOCALISE (called iVOCALISE) operates using the state-of-the art i-vector PLDA (Probabilistic Linear Discriminant Analysis) framework, which has emerged as the dominant approach in high-performance speaker recognition.
This approach offers performance improvements over its predecessors, particularly where there is a significant acoustic mismatch between the samples under comparison such as different recording channels, or different languages, for example.
High Accuracy Pre-trained models
VOCALISE is supplied with a default configuration consisting of pre-trained models that have been tested and optimised. Should the user wish to customise the configuration, there is flexibility at all stages of the speaker recognition process for both the state of the art as well as legacy algorithms.
Adaptation to user-conditions
If you need to use iVOCALISE is a completely new recording condition (say a new kind of microphone and transmission system that is on the market, or a new voice-over-IP software) and you want to tune the performance of the system to this condition, it is possible to simply provide some files in these conditions so that the system can adapt its mathematical models to this conditions.
in the i-vector space.
You can click on a user, and find, and play the closest speakers in this space.
Flexible speaker feature extraction options
Choice of Spectral and Phonetic Features
Spectral features are descriptors of the frequency characteristics which are automatically extracted from a speech sample over short time windows, and are the most commonly used feature-type in speech and speaker recognition VOCALISE currently supports flexible MFCC features with:
- Adjustable frequency band selection
- Optional energy, delta and delta-delta coefficients
- Cepstral mean subtraction (CMS) and variance normalization (CMVN)
Auto phonetic Features
The use of ‘auto-phonetic’ features, i.e. phonetic features extracted in an automatic (unsupervised) way, is supported via an interface with Praat. Currently, any combination of formants F1 to F4 can be selected via the user interface. Other auto-phonetic features, such as pitch, can be included by modifying an external Praat script.
User Provided Features
‘User-provided’ refers to the option that allows users to provide their own features to the system. These may be features that have been manually measured and labelled.
- Playback and visualisation in waveform or spectrogram view.
- Optimized to run quickly on multi-processor machines, and will allow for quick comparison of files using both the spectral as well as phonetic features
- Provides ready-to-use `sessions’, consisting of pre-trained and optimised models. If desired, users can create their own custom session, from scratch or from an existing session, by introducing data and setting modelling parameters for their particular use-case.
- Users can over-ride limitations on the durations of the recordings
- Voice activity detection (VAD) pre-processing step that discards silences or low-energy speech content.
- Unique capability of using either features traditionally used in forensic phonetic acoustic analysis like pitch, formants, etc or well as spectral features such as Mel Frequency Cepstral Coefficients (MFCCs)
- Advanced configurability of MFCCs and Derivatives, i.e. delta and delta-delta coefficients and Cepstral Mean Subtraction (CMS) or Cepstral Mean and Variance Normalisation (CMVN)
- Applies techniques similar to those used in automatic speaker recognition to phonetic features. These include the use of a universal background model, consideration of formant dynamics and long term formant distributions, as well as GMM-modeling, feature scaling and normalisations.
- Drag-and-drop: The user can specify a directory or folder which contains the files to compare against, or drop the files into the interface for comparison
- Result Presentation: The results will be presented in a sorted list consisting of the most similar audio files first and then in decreasing order.
- Result Playback: Clicking on any of the items in the sorted list will bring up the audio and the user will be able to listen and compare the original audio and the test audio side-by-side.
- Mode Switching: The user will be able to switch from Phonetic Analysis mode to MFCC-based spectral mode easily.
- Exporting output scores: The software will contain file saving, editing and copying capabilities, as well as the possibility of cutting and pasting the scores obtained. These scores can be exported into the BioMetrics Software.
Selected Publications using VOCALISE
- 2016: VOCALISE: A forensic automatic speaker recognition system supporting spectral, phonetic, and user-provided features Anil Alexander, Oscar Forth, Alankar Aryal Atreya and Finnian Kelly, [to appear] Speaker Odyssey 2016, Bilbao Spain
- 2014: ‘Forensic Voice Comparisons in German with Phonetic and Automatic Features using VOCALISE software, Michael Jessen, Anil Alexander, and Oscar Forth, Audio Engineering Society (AES) Forensics Conference, 2014, London, UK [Download Article] [Download Presentation]
- 2013: ‘VOCALISE: eine gemeinsame Plattform für die Anwendung automatischer und semiautomatischer Methoden in forensischen Stimmenvergleichen ‘ Polizei & Wissenschaft, 04/2013: Marianne Jessen, Oscar Forth & Anil Alexander [Download Article]