#WebRTC Video Quality Assessment

How to make sure the quality of a [webrtc] video call, or video streaming is good? One can take all possible metrics from the statistic API, is...

How to make sure the quality of a [webrtc] video call, or video streaming is good? One can take all possible metrics from the statistic API, is still be nowhere closer to the answer. The reasons are simple. First most of the statistic reported are about network, and not about video quality. Then, it is known, and people who have try also know, that while those influence the perceived quality of the call, they do not correlate directly, which means you cannot guess or compute the video quality based on those metrics. Finally, the quality of a call is a very subjective matter, and those are difficult for computers to compute directly.

In a controlled environment, e.g. in the lab, or while doing unit testing, people can use reference metrics for video quality assessment, i.e. you tag a frame with an ID on sender side, you capture the frames on the receiving side, match the ID (to compensate for jitter, delay or other network induced problems) and you measure some kind of difference between the two images. Google has “full stack tests” that do just that for many variations of codecs and network impairments, to be run as part of their unit tests suite.

But how to do it in production and in real-time?

For most of the WebRTC PaaS use cases (Use Case A), the reference frame is not available (it would be illegal for the service provider to access the customer content in any way). Granted, the user of the service could record the stream on the sender side and on the receiving side, and compute a Quality Score offline. However, this would not allow to act on or react to sudden drops in quality. It would only help for post-mortem analysis. How to do it in such a way that quality drops can be detected and act on in real-time without extra recording, upload, download, ….?

Which webrtc PaaS provides the best Video Quality in my case, or in some specific case? For most, this is a question that can’t be answered. How can I achieve this 4×4 comparison, or this zoom versus webrtc, in real time, automatically, while instrumenting the network?

CoSMo R&D came up with a new AI-based Video Assessment Tool to achieve such a feat in conjunction with its KITE testing engine, and corresponding Network Instrumentation Module. The rest of this blog post is unusually sciency so reader beware.

INTRODUCTION

First experiments of Real-Time Communication (RTC) over Internet started in 1992 with CU-SeeMe developed at Cornell University. With the launch of Skype in August 2003, RTC over Internet quickly reached a large public. Then from 2011, WebRTC technology made RTC directly available on web browsers and mobile applications.

According to the Cisco Visual Networking Index released in June 2017 [1], live video traffic (streaming, video conferencing) should grow dramatically from 3% of Internet video traffic in 2016 to 13% by 2021, which translates as 1.5 exabytes (1 exabyte = 1 million terabytes) per month in year 2016, growing to 24 exabytes per month in year 2021.

As for any application that deals with video, Quality of Experience (QoE) for the end user is important. Many tools and metrics have been developed to assess automatically the QoE for video applications. For example, Netflix has developed the Video Multimethod Assessment Fusion (VMAF) metric [2] to measure the quality delivered by using different video encoders and encoding settings. This metric helps to assess routinely and objectively the quality of thousands of videos encoding with dozens of encoding settings.

But it requires the availability of the original reference non distorted video to compute the quality score of the same video distorted by video compression. This method, well adapted to video streaming of pre-recorded content where original non distorted video is available, cannot be applied to RTC where, usually, the original video is not available.

One may propose to record the original video from the source side before encoding and transfer to the remote peer(s), but then video quality assessment cannot be done in real-time. In addition, recording of live videos during a real-time communication pose the problem of legal and security issues. For these reasons, the entity performing video quality assessment, for instance a third party platform-as-a-service (PaaS), might not be authorized to store the media. In that case, the analysis of the video to assess its quality is still doable as long as it happens while the video is in memory. It must not be recorded and stored on a disk, which prevents usage of any kind of reference when assessing quality.

Therefore, the special case of RTC cannot be solved by metrics requiring the reference video. So it is necessary to use other metrics that able to assess the quality of a video without requiring access to a reference. Such metrics are known as No-Reference Video Quality Assessment (NR-VQA) metrics.

I. Video Quality Metrics

Video quality assessment techniques can be classified into three categories.
First, there are full-reference (FR) techniques which require full access to the reference video. In FR methods, we find the traditional approaches to video quality: Signal-to-Noise Ratio (SNR), Peak Signal-to-Noise Ratio (PSNR) [3], Mean Squared Error (MSE), Structural SIMilarity (SSIM) [4], Visual Information Fidelity (VIF) [5], VSNR [6] or the Video Quality Metric tools (VQM) [7].

These metrics are well-known and easy to compute, but they are a poor indicator of quality of experience as shown by many experiments [8,9].

Then there are the reduced-reference (RR) techniques which need a set of coarse features extracted from the reference video.

At last the no-reference (NR) techniques do not require any information about the reference video. Indeed, they do not need any reference video at all.

A comprehensive and detailed review of NR video quality metrics has been published in 2014 [10]. A more recent survey of both audio and video quality assessment methods has been published in 2017 [11]. The metrics are classified into two groups: pixel-based methods (NR-P) that are computed from statistics derived from pixel-based features, and bitstream methods (NR-B) that are computed from the coded bitstream.

II. Previous Efforts for WebRTC Video Quality Assessment.

A first initiative about evaluating video quality of a broadcast to many viewers through WebRTC has been proposed in [12]. For this experiment, the authors use the structural similarity (SSIM) index [4] as measurement of video quality. The aim of the test is to measure how many viewers can join to view the broadcasting while maintaining an acceptable image quality. The results are not conclusive at assessing precisely the user experience. As the number of viewers joining the broadcast increases, the SSIM measure remains surprisingly stable with values in the interval [0.96, 0.97]. Then suddenly, when the number of clients reaches approximately 175, SSIM drops down to values near 0. It is unlikely that the user experience remains acceptable without loss in quality when there is an increase from 1 to 175 viewers. Besides, the test has been performed using fake clients that implement only the parts responsible of negotiation and transport in WebRTC, not the WebRTC media processing pipeline, which is not realistic to assess video quality of a broadcast experiment.

In [13], the authors evaluate various NR metrics on videos impaired by compression and transmission over lossy networks (0 to 10\% packet loss). The eight NR metrics studied are complexity (number of objects or elements present in the frame), motion, blockiness (discontinuity between adjacent blocks), jerkiness (non-fluent and non-smooth presentation of frames)), average blur, blur ratio, average noise and noise ratio. As none of these NR metrics is able to provide accurate evaluation of the quality of such impaired videos, they propose the use of machine learning techniques to combine several NR metrics and two network measurements (bit rate and level of packet loss) for providing an improved NR metric able to give video ratings comparable to those given by Video Quality Metric (VQM), a reliable FR metric providing good correlation with human perception. For this experiment, they used ten videos obtained from the Live Quality Video Database. These videos have been compressed at eight different levels using H.264, and impaired by transmission over a network with twelve packet loss rates.
They assessed the quality of their results against the scores given by the FR metric Video Quality Metric (VQM) [14], but not against NR metrics.

In [15], the authors rely on many bitstream-based features to evaluate the impairments of the received video and how these impairments affect perceptual video quality.

The paper [16] presents a combination of audio and video metrics to assess audio-visual quality. The assessment has been performed on two different datasets.
First they present the results of the combination of FR metrics. The FR audio metrics chosen by the authors are the Perceptual Evaluation of Audio Quality (PEAQ) [17] and the Virtual Speech Quality Objective Listener (ViSQOL) [18]. As for the FR video metrics, they used the Video Quality Metric (VQM) [7], the Peak Signal-to-Noise Ratio (PSNR) and the Structural SIMilarity index (SSIM) [4].
Then they present the results of the combination of NR metrics. The NR audio metrics are the Single Ended Speech Quality Assessment metric (SESQA) and the reduced SESQA (RSESQA) [19]. For the NR video metrics, they used a blockiness-blurriness metric [20], the Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE) [21], the Blind Image Quality Index (BIQI) [22] and the Naturalness Image Quality Evaluator (NIQE) [23]. The best combination for both datasets is the blockiness-blurriness with RSESQA.

A recent experiment to estimate the quality of experience of WebRTC video streaming on mobile broadband networks has been published in [24]. Different videos of various resolution (from 720×480 to 1920×1080) have been used as input for a video call through WebRTC between Chrome browser and Kurento Media Server. The quality of WebRTC videos has been assessed subjectively by 28 people giving a score from 1 (bad quality) to 5 (excellent quality). Then authors made use of several metrics, all based on errors computed between the original video and the WebRTC video, to assess objectively the quality of WebRTC videos. Unfortunately, the authors do not report clearly if there is a correlation between the subjective assessment and the objective measures computed.

III. NARVAL: Neural network-based Aggregation of no-Reference metrics for Video quAlity evaLuation.

III.1 Methodology

There are two main parts in this work: first, the extraction of features from videos representative of Video conferencing use case (as opposed to pre-recorded content used by e.g. netflix), then the training of a model to predict a score for a given video. We used six publicly available video quality datasets containing various distortions that may occur during a video communication to train and evaluate the performance of our model.

NARVAL TRAINING: Dense Deep Neural Network Graph

For the feature extraction part, we selected metrics and features published and evaluated on different image quality datasets. After calculating them on the videos of our databases, we stored the data to be able to reuse them in the training part. The data can then be processed to serve for our training model for example taking the mean of a feature over the video.For the second part, we used different models of regression, mainly neural network with variation of the input and the layers, and also support vector regressor.

We tested multiple combinations of parameters for each model and only kept the best for each category of model. Convolutional, recurrent and time delay neural networks were used in addition to the most basic ones.

NARVAL TRAINING: 3D Convolutional Network Graph.

We trained our model on the databases using a 5-fold fit and then repeating the training multiple times. As each database contains multiple distortions, we cannot just split the folds randomly, thus we tried to choose the 5 fold so that all distortion exists in a fold and we kept the same distribution for all tests. Only the mean the folds will then be taken into account.

Another approach to build a fold is to make a video and its distortion a fold. Using this method the folds are much smaller and the validating fold is completely new to the model.

III.2 Results

The results were first validated against a training set, i.e. a set with known scores, to see if our computed video quality was matching the known value, as illustrated below.

For sanity check, we then computed the score provided by NARVAL agains the SSIM and WMAF scores on the same reference video. We can see that while not exactly equivalent, the score exhibit the same behaviour. Funny enough, it also illustrates a result known in the image processing community but apparently counter-intuitive in the WebRTC community: the Perceivedvideo quality does not decrease linearly with the bitrate / bandwidth. You can see on the figure below, that to reduce the quality by 10%, you need to reduce the bandwidth by a factor 6 to 10 !

Conclusion

Practically it means that you can now use NARVAL to compute the Video Quality in the absence of reference frame or video! It opens the door to much simpler implementations in existing use case, and to a lot of new use cases where the quality can be assessed at any given point of a streaming pipeline.

The full Research Report is available from CoSMo. CoSMo also provides licenses to two implementations: one python implementation more for research and prototyping, and one C/C++ implementation for speed and SDK embedding. Eventually, the video quality assessment will be proposed as a service, not unlike the AQA Service by Citrix was built on top of POLQA.

Narval has already been added to the KITE testing engine [25], to enable evaluation of the video quality of Video Services under all kinds of network and load conditions.

KITE is the only webrtc testing solution that allows you to test Desktop and mobile clients (wether browsers or native apps). Its Network instrumentation module allows to programmatically control network features, separately client by client, server by server to bring together all kind of heterogeneous test beds. It allowed CoSMo to conduct the first Comparative load testing of open source WebRTC Servers [26]. If you are interested in having this capacity in house, or to have us run tests for you, contact us.

Bibliography

[1] – Visual Networking Index, Cisco, 2017.
[2] – Toward A Practical Perceptual Video Quality Metric, Netflix, 2016.
[3] – Objective video quality measurement using a peak-signal-to-noise-ratio (PSNR) full reference technique, American National Standards Institute, Ad Hoc Group on Video Quality Metrics, 2001.
[4] – Image Quality Assessment: From Error Visibility to Structural Similarity, Wang et al., 2004.
[5] – Image information and visual quality, Sheik et al., 2006.
[6] – VSNR: A Wavelet-Based Visual Signal-to-Noise Ratio for Natural Images,
chandler et al., 2007.
[7] – A new standardized method for objectively measuring video quality, Margaret H. Pinson and Stephen Wolf, 2004.
[8] – Mean Squared Error: Love It or Leave It? A new look at Signal Fidelity Measures, Zhou Wang and Alan Conrad Bovik, 2009.
[9] – Objective Video Quality Assessment Methods: A Classification, Review, and Performance Comparison, Shyamprasad Chikkerur et al., 2011.
[10] – No-reference image and video quality assessment: a classification and review of recent approaches, Muhammad Shahid et al., 2014.
[11] – Audio-Visual Multimedia Quality Assessment: A Comprehensive Survey,Zahid Akhtar and Tiago H. Falk, 2017.
[12] – WebRTC Testing: Challenges and Practical Solutions, B. Garcia et al., 2017.
[13] – Predictive no-reference assessment of video quality, Maria Torres Vega et al., 2017.
[14] – A new standardized method for objectively measuring video quality, Margaret H. Pinson and Stephen Wolf, 2004.
[15] – A No-Reference bitstream-based perceptual model for video quality estimation of videos affected by coding artifacts and packet losses, Katerina Pandremmenou et al., 2015.
[16] – Combining audio and video metrics to assess audio-visual quality, Helard A. Becerra Martinez and Mylene C. Q. Farias, 2018.
[17] – PEAQ — The ITU Standard for Objective Measurement of Perceived Audio Quality, Thilo Thiede et al., 2000.
[18] – ViSQOL: The Virtual Speech Quality Objective Listener, Andrew Hines et al., 2012.
[19] – The ITU-T Standard for Single-Ended Speech Quality Assessment, Ludovic Malfait et al., 2006.
[20] – No-reference perceptual quality assessment of {JPEG} compressed images, Zhou Wang et al, 2002.
[21] – Blind/Referenceless Image Spatial Quality Evaluator, Anish Mittal et al., 2011.
[22] – A Two-Step Framework for Constructing Blind Image Quality Indices, Anush Krishna Moorthy and Alan Conrad Bovik, 2010.
[23] – Making a “Completely Blind” Image Quality Analyzer, Anish Mittal et al., 2013.
[24] – Quality of Experience Estimation for WebRTC-based Video Streaming, Yevgeniya Sulema et al., 2018.
[25] – Real-time communication testing evolution with WebRTC 1.0, Alexandre Gouaillard and Ludovic Roux, 2017.
[26] – Comparative study of WebRTC Open Source SFUs for Video Conferencing, Emmanuel Andre et al., 2018

Other Blog Posts