Perceptual Objective Listening Quality Analysis
Perceptual Objective Listening Quality Analysis (POLQA), also known as ITU-T Rec. P.863[1] is an ITU-T Standard that covers a model to predict speech quality by means of analyzing digital speech signals.
Measurement scope
POLQA covers a model to predict speech quality,[2][3] by means of digital speech signal analysis. The predictions of those objective measures should come as close as possible to subjective quality scores as obtained in subjective listening tests. Usually, a Mean Opinion Score (MOS) is predicted. POLQA uses real speech as a test stimulus for assessing telephony networks.
Technology capabilities
POLQA is the successor of PESQ (ITU-T Rec. P.862). POLQA avoids weaknesses of the current P.862 model and is extended towards handling of higher bandwidth audio signals. Further improvements target the handling of time called signals and signals with many delay variations. Similarly to P.862,[4] POLQA supports measurements in the common telephony band (300–3400 Hz), but in addition it has a second operational mode for assessing HD-Voice in wideband and super-wideband speech signals (50–14000 Hz). POLQA also targets the assessment of speech signals recorded acoustically by an artificial head with mouth and ear simulators.
Development history
The POLQA activities started in ITU-T in early 2006 under the working title P.OLQA. In mid-2009 a competition was started to evaluate several candidate models. In May 2010 ITU-T selected candidate models from three companies, OPTICOM, SwissQual and Rohde & Schwarz company, and TNO (Netherlands Organisation for Applied Scientific Research), to form the future Recommendation P.863. The three companies were asked to merge their approaches to one single standardized model. The result is now standardized as POLQA / P.863.[1]
Genealogy of related standards
ITU-T’s family of full reference objective voice quality measurements started in 1997 with P.861 (PSQM), which was superseded by P.862 (PESQ)[4] in 2001. P.862 was later complemented with the recommendations P.862.1[5] (mapping of PESQ scores to a MOS scale), P.862.2[6] (wideband measurements) and P.862.3[7] (application guide). Since 2011 P.863 (POLQA)[1] is in force. Two additional implementer’s guides for P.863 have been consented by ITU-T Study Group 12 in November 2011. In addition to the above listed full reference methods, the list of ITU-T’s objective voice quality measurement standards also includes P.563[8] (no-reference algorithm).
Testing typology
POLQA, similar to P.862 PESQ, is a Full Reference (FR) algorithm that rates a degraded or processed speech signal in relation to the original signal. It compares each sample of the reference signal (talker side) to each corresponding sample of the degraded signal (listener side). Perceptual differences between both signals are scored as differences. The perceptual psycho-acoustic model is based on similar models of human perception as MP3 or AAC. Basically, the signals are analysed in the frequency domain (in critical bands) after applying masking functions. Unmasked differences between the two signal representations will be counted as distortions. Finally, the accumulated distortions in the speech file are mapped into a 1 to 5 quality scale as usual for MOS tests. FR measurements deliver the highest accuracy and repeatability but can only be applied for dedicated tests in live networks (e.g. drive test tools for mobile network benchmarks).
POLQA is a full-reference algorithm and analyzes the speech signal sample-by-sample after a temporal alignment of corresponding excerpts of reference and test signal. POLQA can be applied to provide an end-to-end (E2E) quality assessment for a network, or characterize individual network components.
POLQA results principally model mean opinion scores (MOS) that cover a scale from 1 (bad) to 5 (excellent).
Description of the POLQA Algorithm
The inputs to the algorithm are two waveforms represented by two data vectors containing 16 bit PCM samples. The first vector contains the samples of the (undistorted) reference signal, whereas the second vector contains the samples of the degraded signal. The POLQA algorithm consists of a temporal alignment block, a sample rate estimator of a sample rate converter, which is used to compensate for differences in the sample rate of the input signals, and the actual core model, which performs the MOS calculation. In a first step, the delay between the two input signals is determined and the sample rate of the two signals relative to each other is estimated. The sample rate estimation is based on the delay information calculated by the temporal alignment. If the sample rate differs by more than approximately 1%, the signal with the higher sample rate is down sampled. After each step, the results are stored together with an average delay reliability indicator, which is a measure for the quality of the delay estimation. The result from the re-sampling step, which yielded the highest overall reliability, is finally chosen. Once the correct delay is determined and the sample rate differences have been compensated, the signals and the delay information are passed on to the core model, which calculates the perceptibility as well as the annoyance of the distortions and maps them to a MOS scale. A much more detailed and comprehensive description of the algorithm can be found in.[1] The next few sections are only intended to give an overview on the basics of POLQA’s internal structure.
The Core Model
The main element of the core model is the perceptual model which is calculated four times using different parameters in order to cope with different major distortion types. Those distortion types can be split into additive distortions and subtracted distortions. For both types a further distinction is made between very strong and weaker effects. The inputs to the perceptual models are waveforms and the delay information. The output is the Disturbance Density, which is a measure for the perceptibility of distortions in the signals. The perceptual model for the main branch also produces indicators for Frequency distortions, Noise and Reverberation distortions. A subsequent switch which is triggered by a detector for very strong distortions reduces the four Disturbance Density values down to two, one for added and one for subtracted distortions. So far the Disturbance Density is an indicator for the perceptibility of distortions only and cognitive effects are not yet taken into account. Cognitive aspects are however important when human beings are asked to score the quality of what they can perceive. Essentially they convert the perceptibility measure Disturbance Density into an annoyance measure. This conversion is performed by correcting the Disturbance Density values for situations with:
- Significant level variations
- Many frame repetitions
- Strong timbre
- Spectral flatness
- Noise switching during speech pauses
- Many delay variations
- Strong variations of the Disturbance Density over time
- Strong variations of the loudness of the signals
Two further indicators, one for spectral flatness and one for level variations are also calculated in this step.
So far all operations were performed on frames with a duration of approximately 32 and 43ms duration (depending on the sample rate and using an overlap of 50%) and for each Bark band separately. In a final step all indicators are integrated over time and frequency in order to compute the final MOS LQO value.
The Perceptual Model
The key concept inside the perceptual model is Idealisation. The idea behind this is, that POLQA is supposed to simulate Absolute Category Rating (ACR) tests. In an ACR test however, subjects have no comparison to the actual reference signal when they score a speech signal. Instead, it is assumed that subjects have an understanding of what an ideal signal sounds like and they use this as their own reference. Consequently, if they are asked to score a reference signal which is not absolutely perfect (e.g. it has the wrong volume or contains too much timbre, noise or reverberation), it will be scored worse than perfect. In its idealization step POLQA therefore corrects small imperfections of the reference signals in order to derive the same ideal reference for the comparison to the degraded signal as human subjects would use in their minds. Similar to the idealization of the reference signal, some distortions present in the degraded signal which are hardly perceptible in an ACR test will be partially compensated (e.g. small pitch shifts, linear frequency distortions). The perceptual model starts with scaling the reference signal to an ideal average active speech level of approximately -26dBov. No such scaling is performed on the degraded signal. It is assumed that any deviation of the level of the degraded signal from the ideal -26dBov is to be scored as a degradation of the signal. Next, the spectra of both signals are computed using an FFT with 50% overlapping frames with a duration of between 32ms and 43ms duration (depending on the sample rate). Subsequently small pitch shifts of the degraded signal will be eliminated (Frequency Dewarping). Now, the spectra will be transformed to a psychoacoustically motivated pitch scale, by combining individual spectral lines (FFT bins) to so-called critical bands. The pitch scale used is similar to the Bark scale with an average resolution of 0.3 Bark per band. The result is the Pitch Power Density. At this stage the first three distortion indicators for frequency response distortions, additive noise and room reverberations are calculated. After this, the excitation of each band is derived. This includes the modeling of masking effects in the frequency as well as in the temporal domain. The result is for each frame of each signal a head-internal representation which indicates roughly how loud each frequency component would be perceived. Now, a further idealization step of the reference signal takes place by filtering out excessive timbre and low level stationary noise. At the same time, linear frequency distortions and stationary noise are partially removed from the degraded signal. A subtraction of the idealized excitations finally leads to the Distortion Density, which is measure for the audibility of distortions.
POLQA in research
A paper which uses POLQA to investigate the impact of tone language and non-native listening on speech quality measurement can be found in.[9]
See also
References
- http://www.itu.int/rec/T-REC-P.863/en ITU-T Recommendation P.863: Perceptual objective listening quality assessment
- http://www.aes.org/e-lib/browse.cfm?elib=16829 Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part I—Temporal Alignment
- http://www.aes.org/e-lib/browse.cfm?elib=16830 Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part II—Perceptual Model
- http://www.itu.int/rec/T-REC-P.862/en ITU-T Recommendation P.862: Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs
- http://www.itu.int/rec/T-REC-P.862.1/en ITU-T Recommendation P.862.1: Mapping function for transforming P.862 raw result scores to MOS-LQO
- http://www.itu.int/rec/T-REC-P.862.2/en ITU-T Recommendation P.862.2: Wideband extension to Recommendation P.862 for the assessment of wideband telephone networks and speech codecs
- http://www.itu.int/rec/T-REC-P.862.3/en ITU-T Recommendation P.862.3 Application guide for objective quality measurement based on Recommendations P.862, P.862.1 and P.862.2
- http://www.itu.int/rec/T-REC-P.563/en ITU-T Recommendation P.563: Single-ended method for objective speech quality assessment in narrow-band telephony applications
- D. Ebem (University of Nigeria); et al. (2011). "The Impact of Tone language and Non-Native Language Listening on Measuring Speech Quality" (PDF). Journal of the Audio Engineering Society. 59 (9, 2011 September): 9.