CLEAVER – Cluster Estimation And Versatile Extraction of Regions
‘CLEAVER’ (Cluster Estimation And Versatile Extraction of Regions) software provides an easy means of separating out the speech of individual speakers from a recording that contains more than one speaker. It allows the user to either extract only the speech of the person of interest and highlights the parts of the audio where that person is speaking.
What you can do with CLEAVER?
Using CLEAVER, you can quickly separate out speakers from a multi-speaker conversation, and extract the speech belonging to each speaker. Here is an example from such a speaker separation from an an interview between two speakers from the DyVIS corpus (Cambridge). The first file contains a section of the original interview containing both speakers, and the second and third files contain speech of the interviewer and interviewee.
Our thanks to Cambridge University for the interview recording.
- Automatic speaker separation and diarisation from a multi-speaker audio files
- Blind or assisted initial speaker location with click to select from the analysed audio
- Easy-to-use presets for different settings like ‘very similar speakers’, ‘two speaker telephony’, ‘male and female speakers’, etc. Users can also save their own presets
- Ability to process various bit and sampling rates of audio – CD, telephony, DVD recordings, etc (in mono-wave files and certain other formats).
- Ability to tab-through original and processed audio seamlessly
- Quick review of sample segments to user for assignment to speakers
- In-built exclude list to remove known unwanted noises like tones or hold music
- Pre-trained universal background models trained according to customer’s requirements
- Playback and editing of audio files with a high quality waveform viewer
- Selective addition of sections of audio that need to be included or excluded
- Verbose status update window with comprehensive information about processing time, the number of speaker seeds generated
- Simple click to go controls as well as (hidden) extensive configuration window for power-users
Short video example using CLEAVER’s human-assisted mode
Cleaver uses a novel approach to the blind automatic segmentation of speakers combining the use of phonetic features like pitch and the statistical pattern recognition of short-term power spectrum features like Mel Frequency Cepstral Coefficients (MFCCs). This approach requires minimal user intervention and allows for easy segmentation of the speech of separate speakers from multi-speaker recordings.
e.g. Consider a recording containing a conversation between two speakers, namely Speaker A and Speaker B as well as some non-speech which could be hold music or some other noise. The audio analysed by CLEAVER is separated into the speech of Speaker A, Speaker B and discarded non speech. If there is over-talking is present in the recording, it can also be recovered. The problem tackled by CLEAVER is described in the schema below.
Cleaver uses a two-tier recognition approach using little or no user-interaction. It initially applies speaker clustering based on higher-level phonetic information. A continuous pitch track is a good indicator within an utterance of speaker identity. After splitting the audio into segments that have continuous pitch tracks, a modelling algorithm based on short-term spectral features, agglomerative clustering and Gaussian mixture modelling is used to harvest features belonging to individual speakers. This approach used in the software is detailed below.
Broadly, the processing consists of the following steps:
- The file is loaded into CLEAVER and is split up into clusters. Each cluster, which was selected on the basis of a continous pitch track, should contain only speech from one speaker, or sounds from a particular noise source.
- All the clusters are extracted from the analysis file, and are grouped on the basis of their durations. The top 100 clusters are then compared against each other.
- In the 100 clusters, depending on the number of target speakers the user would like to obtain, a number different ‘seed’ clusters that are sufficiently distinctive and different from each other are extracted.
- These seeds are presented to the user to listen and decide whether they are to be included as potential target speakers or discarded as trash.
- The user then merges the seeds that belong to each target speaker by dragging them on top of each other, and prepares the software for the ‘cleave’ process.
- The system grows these seeds into Speaker Models using clustering that selects the clusters that are most similar to the seed clusters for each target speaker.
- Once the models are created, the system goes through the whole recording and selects clusters that are most similar to each speaker model, and assigns the those clusters to those speakers.
- The user can now listen to, modify and save the clusters created for each target speaker.
This software, for Windows, uses an innovative approach to the blind automatic segmentation of speakers combining the use of phonetic features like pitch and the statistical pattern recognition of short-term power spectrum features like Mel Frequency Cepstral Coefficients (MFCCs). This approach requires minimal user intervention and allows for easy segmentation of the speech of separate speakers from multi-speaker recordings.
An explanation of the algorithm that runs behind it is given in our abstract and presentation at the international association of forensic phonetics and acoustics conference in Spain. You can read this described here: Blind Speaker Clustering Using Phonetic and Spectral Features in Simulated and Realistic Police Interviews.