5 Abstract Video adaptation is a key technology for universal video access in heterogeneous communication environments. The main challenge in this context is the selection of an optimal combination of Multi-Dimensional Adaptation (MDA) operations (such as spatial down-sampling, frame dropping and adjustment of quantization parameters) to maximize the user s Quality of Experience (QoE) under certain resource constraints. To achieve this goal, different factors that affect the perceptual quality need to be considered. The focus of this thesis is to solve this optimization problem by a QoE-driven approach. To begin with, extensive subjective experiments are conducted to study the human preference between temporal and spatial details for different types of video content and a wide range of bit rates. A detailed analysis of the experimental data unveils how the perceived quality is influenced by the video content, available transmission rate and MDA operations. Moreover, the impacts of SNR, temporal and spatial resolution on the perceptual video quality are modelled separately based on the observations from the subjective test and a multi-dimensional video quality metric MDVQM is proposed. Performance evaluations using subjective quality ratings show that the proposed video quality metric provides accurate quality estimation in the presence of different spatial and temporal quality impairments. Furthermore, accurate rate adaptation based on ρ-domain analysis is studied. The proposed rate control algorithm combines ρ-domain rate model and header size estimation for H.264/AVC video. Experimental results show that the proposed algorithm achieves better rate control accuracy and video quality when compared with the original ρ-domain rate control algorithm. Finally, this thesis ends up with a QoE-driven multi-dimensional video adaptation scheme combining both the proposed video quality metric and the rate control algorithm. The video quality metric is used to predict the resulting QoE under different adaptation modes. The optimal combination of adaptation operations is determined by considering both the resulting QoE and computational complexity. Significant QoE improvement against conventional video adaptation schemes has been confirmed by performance evaluation using various types of video contents. v

9 Acknowledgements First and foremost, I would like to express my sincere gratitude to my advisor, Prof. Dr.-Ing Eckehard Steinbach, who gave me the opportunity to be involved in this interesting research topic and whose guidance, discussions and suggestions were of great help during the work. And I am thankful to my second examiner Prof. Dr.-Ing Alexander Raake for reading my thesis and giving pertinent comments. At the same time, I would like to thank Prof. Dr.-Ing... for chairing my thesis defence. Moreover, I wish to thank Deutscher Akademischer Austausch Dienst for partially supporting my doctoral research. Furthermore, I thank the current and previous members of Institute for Media Technology at TU München. They have made my life and work at our institute an enjoyable experience. I would like to give special thanks to Dr.-Ing Wei Tu, Dr.-Ing Yang Peng, Dr.-Ing Hu Chen, Jianshu Chao and Xiao Xu for the fruitful joint work and discussions. I would like to express my deepest gratitude to my wife Cong Li who had been at my side in a supporting role giving me all the love I needed. Finally, I would like to thank my parents who, from the first day of my education, have provided tremendous support, constant encouragement, and great guidance. ix

14 iv LIST OF FIGURES 3.7 Spatial Activity (SA) and Temporal Activity (TA) variation against bit-rate for typical test sequences Actutal DMOS vs. predicted DMOS from the SNR quality models TCF vs. bpp without correction Illustration of the bpp correction process TCF vs. bpp with correction Actual DMOS vs. predicted DMOS from the temporal quality metrics SCF curves with/without correction Actual DMOS vs. predicted DMOS for the spatial quality metrics Linear relationship between the actual DMOS and the predicted DMOS for different STCF models Relationship between the percentage of non-zero coefficients (1-ρ) and the number of generated bits (total bits and texture bits) for test sequences. (a) Foreman. (b) Mother&Daughter(M&D) Performance of the MV-based rate model in Eq.(4.9) Performance of MVD-based rate model in equation (4.10) distribution of bits in the MB header Relationship between the percentage of non-zero coefficients (1-ρ) and the size of the CBP information for the sequence Foreman (left: high bit-rate range, right: low bit-rate range) Experimental results for Eq.(4.14) Comparison of the frame size fluctuation of different rate control methods Comparison of QP fluctuation within one frame Workflow of the video adaptation algorithm Example frames of test sequences Video quality comparison between the QoE-driven adaptation scheme and three non-adaptive strategies (SNR-only, Spatial-only, Temporal-only) for the video STAN- DARD 30fps (transcoded from 4Mbps to 512kbps) Video quality comparison between the QoE-driven adaptation scheme and three non-adaptive strategies (SNR-only, Spatial-only, Temporal-only) for the video STAN- DARD 60fps (transcoded from 8Mbps to 1Mbps) Video quality comparison between the QoE-driven adaptation scheme and three non-adaptive strategies (SNR-only, Spatial-only, Temporal-only) for the video BBC- NEWS 30fps (transcoded from 6Mbps to 512kbps) Sample frames of the test videos adapted using SNR-only mode(left) and QoEdriven adaptation scheme(right)

15 List of Tables 2.1 ITU Recommendations for objective video quality metrics Bit-rates, frame rates and spatial resolutions of the processed video sequences for Test I Bit-rates, frame rates and spatial resolutions of the processed video sequences for Test II Number of subjects in the tests. The numbers in the bracket indicate the number of test subjects rejected by the screening process in each subtest as discussed in Section Pearson correlation values of the SNR quality metrics RMSE values of the SNR quality metrics Outlier ratios of the SNR quality metrics Cross validation result for the SNR metrics Model parameters trained with all the subjective ratings Pearson correlation values of the temporal quality metrics RMSE values of the temporal quality metrics Outlier ratios of the temporal quality metrics Cross validation results for the temporal quality metrics Pearson correlation values of the spatial quality metrics RMSE values of the spatial quality metrics Outlier ratios of the spatial quality metrics Cross validation result for the spatial quality metrics Pearson correlation values of the spatial-temporal quality metrics RMSE values of the spatial-temporal quality metrics Outlier ratios of the spatial-temporal quality metrics Performance of video quality metrics using different STCF models (Eqs.( )) when both TR and SR are changed v

16 vi LIST OF TABLES 4.1 Performance comparison of the two rate models for motion bits in Eq.(4.9) and Eq.(4.10) Performance of the proposed rate model for CBP bits Performance comparison of the rate control algorithms Average deviation of the actual frame size from the target frame size Average variance and maximum difference of QP values within a frame Encoding bit-rates of the input and output video streams for performance evaluation Video quality improvement of the QoE-driven adaptation scheme against nonadaptive strategies (for video STANDARD 30fps transcoded from 4Mbps) Video quality improvement of the QoE-driven adaptation scheme against nonadaptive strategies (for video BBCNEWS 30fps transcoded from 6Mbps) Video quality improvement of the QoE-driven adaptation scheme against nonadaptive strategies (for video STANDARD 60fps transcoded from 8Mbps)

19 Chapter 1 Introduction 1.1 Motivation Over the past 20 years, internet video applications (such as video streaming, video telephony, video sharing, etc.) have gradually become an indispensable part of our daily lives. The fast development of mobile networks has inspired the idea of Universal Multimedia Access (UMA)[MSL99] which has further propelled the boom of video services in the internet. The aim of UMA is to allow the users to access the multimedia content at anytime and from anywhere. This has raised a big challenge to the traditional video transmission system due to the different display characteristics of the video content (such as frame rate and spatial resolution), the heterogeneity of the transmission channels, the time-varying nature of the mobile networks, and the diversity of the users end-devices. Figure 1.1 shows a typical application scenario for video streaming. The video contents are encoded and stored on media servers, which are normally owned by content providers (such as film companies or news agencies). The media servers are connected to the core network. The core network might be a traditional best-effort IP network or a special Content Delivery Network (CDN). Then between the core network and the end-users, there might be diverse access networks (which are often referred to as the last mile in the delivery path). The access networks can be classified by different access rates, from low-bitrate networks such as traditional dial-up networks over PSTN or 2.5G mobile networks (GPRS), to middle bitrate networks such as low-speed xdsl or 3G mobile networks, to broadband services such as high-speed xdsl, 4G mobile networks, Wifi/WiMax or fiber networks. It is quite likely that the content server does not have any a priori knowledge about the device capacities or network conditions of the end users. So conventionally, the video content is often encoded with the goal of optimizing the rate-distortion performance. Therefore a 1

20 2 CHAPTER 1. INTRODUCTION Figure 1.1: Media delivery over heterogeneous access networks (adapted from [Lab]) video stream needs to be adapted along the delivery path before it can be delivered to the end-user. One way to manage the heterogeneity is simulcast [CAL96, MJ96], in which the same video content is encoded at several different bit-rates or even using different coding standards. Then different versions of the video content are delivered to the end users according to their specific characteristics. This strategy might work well for wired networks where the transmission capacity of the users is fixed and relatively stable. For end-users connected over mobile networks, however, the available transmission rate is time-varying and hard to predict, so the chosen bitstream may not match the user s transmission characteristics very accurately. Also, storing multiple versions of the same content consumes more storage on the content server, which is costly for content providers. An alternative solution to simulcast is to apply video adaptation at the edge of the access network. In this solution, the video stream received by the end-user is no longer directly transmitted from the content server, but is generated at an intermediate network node by

21 1.1. MOTIVATION 3 adapting the original video stream from the server to match the user s network characteristics and device capacities. This could be done, for example, on a proxy which is built on top of the gateway nodes (base stations, access points, routers etc.) in Figure 1.1. If the video stream is delivered through a CDN, normally there are also special proxy servers deployed at the edge of the core network, which can also be used to perform the video adaptation tasks. In this solution, only one high quality version of the video content needs to be stored on the content server. When a user requests to access a video content, the stored video stream is first delivered to the proxies, the proxy then performs the video adaptation in real-time to meet the user s requirement. Compared with the simulcast solution, performing the video adaptation at the edge of the core network can save valuable storage space on the content server. Furthermore, since only one version needs to be transmitted through the core network and the video stream can be cached on the proxies, this solution can also reduce the traffic in the core network. Video adaptation can be performed by adjusting different parameters of the encoded stream such as quantization step-size, frame rate and spatial resolution of the video content. Multi-Dimensional Adaptation (MDA) refers to the schemes where the impacts of all these factors are considered jointly to meet the resource constraints and optimize the video quality. Joint optimization among different dimensions offers us more opportunities for quality optimization but also raises several new challenges, which include the assessment of video quality under different spatial/temporal resolutions and the selection of the optimal combination of adaptation operations [Wan05]. These issues in MDA can be solved by a Quality-of-Experience (QoE) driven approach. Since the target users of most video delivery systems are human beings, the most reasonable way for quality assessment is to collect the user s opinions on the delivered video streams. The user s satisfaction level is often referred to as Quality-of-Experience of the users. The most accurate way to measure QoE is by conducting subjective tests. However, subjective tests are usually costly and time-consuming, so this approach is not practical for the evaluation of video quality in real-time. Due to the limitations of subjective quality assessment and the increasing demand for in-service quality assessment, there have been intensified studies of perceptual Video Quality Metrics (VQM) which aim to estimate the QoE of a video processing system by taking into account the characteristics of human visual perception. QoE-driven MDA schemes utilize perceptual VQMs to assess the video quality and make optimal adaptation decisions on the fly when needed. The general diagram of a QoE-driven multi-dimensional video adaptor is shown in Figure 1.2. In Figure 1.2, the resource allocator collects the feedback information (such as network conditions and user s preference) from the channel and the end-users. Based on the collected information, it determines the target source coding bit-rate for the adaptation operation

22 4 CHAPTER 1. INTRODUCTION Figure 1.2: Block Diagram of a QoE-driven multi-dimensional video adaptation system (BR ) and passes this information to the mode selector. On the other hand, the incoming video stream is decoded by the video decoder and the decoded video frames are used by the mode selector to extract necessary video features. The perceived quality of the adapted videos under different adaptation modes can be estimated by feeding all the information to a perceptual video quality metric. Taking into account various factors (such as the computational complexity and resulting QoE of different adaptation operations), the mode selector determines the optimal parameters (e.g., spatial resolution SR and frame rate T R ) for the video adaptation operations. According to the decisions of the mode selector, the encoder performs proper adaptation operations. The rate control module interacts with the encoder to guarantee the adapted video meets the rate requirements given by the resource allocator. In this thesis, various aspects of such a QoE-driven MDA system are studied. More specifically, the focus is put on the estimation of the perceived video quality (quality metric), the selection of optimal adaptation mode (mode selector) as well as the accurate control of the bitrate (rate controller). The corresponding modules are marked in grey in Figure Summary of Main Contributions The main contributions of this dissertation can be summarized as follows: ˆ The individual and overall impact of different video properties (such as the quantization

23 1.3. OUTLINE OF THE THESIS 5 step-size, the spatial resolution, the frame rate) on the perceived video quality are studied through specifically designed subjective tests. Prior works in the literature concerning video quality assessment mainly focus on videos with fixed frame rate. In this work, subjective tests are conducted which help us to understand how the perceived video quality is affected when the spatial and temporal resolution of the video content are changed separately or even jointly. ˆ An accurate no-reference VQM for evaluating the perceived video quality at different frame rates and spatial resolutions is presented and its prediction performance is analyzed. The proposed metric models the overall video quality as the product of separate items, with each of the items simulating the impact of quantization, frame dropping and spatial down-sampling, respectively. All the features used in the VQM can be easily computed from the encoded bitstream so that it is well suited for in-service video quality estimation. The performance of the proposed metric is also validated by the results of the subjective tests. ˆ An accurate rate control algorithm based on ρ-domain analysis is proposed. The approach uses a two-stage encoder structure to resolve the inter-dependency between RDO and ρ-domain rate control. The size of the header information is estimated using an improved rate model which considers the different components in a macroblock header. Experimental results show that the proposed algorithm achieves better rate control accuracy and video quality when compared with the original ρ-domain rate control algorithm. The proposed rate control algorithm can be used together with the video quality metric to perform accurate bitrate adaptation for QoE optimization. ˆ Based on the proposed VQM and rate control algorithm, a QoE-driven MDA scheme is developed for optimizing the perceived video quality. The adaptation scheme uses the proposed VQM to estimate the resulting video quality under different adaptation modes and then determines the optimal adaptation mode by taking into account the video quality as well as the computational complexity. The algorithm is evaluated and shown to provide better performance than conventional adaptation schemes. 1.3 Outline of the Thesis The rest of the thesis is arranged as follows. Chapter 2 outlines the main aspects of video quality assessment and gives a review of the state-of-the-art video quality metrics. In Chapter 3, a multi-dimensional video quality metric is proposed and evaluated with results from extensive subjective tests. Then, an improved ρ-domain rate control algorithm for H.264/AVC video with header size estimation is presented in Chapter 4. The proposed QoE-aware video

24 6 CHAPTER 1. INTRODUCTION adaptation scheme is presented in Chapter 5 together with performance evaluation. The thesis concludes in Chapter 6 with a summary of the results. Parts of this thesis have been published in [ZS11, SZPD12].

25 Chapter 2 Overview of Video Quality Assessment Nowadays, video data is responsible for a considerable part of the total internet traffic due to the boom of various video related services and the remarkable evolution of network technologies and mobile devices. Video quality assessment is fundamental to monitor and guarantee the quality of these video services. Depending on whether human observers are involved in the assessment process, video quality assessment can be performed subjectively or objectively. The purpose of subjective video quality assessment is two-fold. First, it can be used to evaluate or compare the performance of different video processing algorithms/systems. Second, it can help us to find out how the perceived video quality is affected under different conditions. On the other hand, objective video quality assessment estimates the video quality using video features which can be measured and computed objectively, thus makes it possible to monitor and optimize the video quality automatically. Both of them are indispensable parts of designing and evaluating a video system which aims to provide the best QoE to the users. In this chapter, background and related work in the field of both subjective and objective video quality assessment are discussed. Section 2.1 provides a summary of the guidelines given in the ITU standard documents for conducting subjective tests. This is followed by a review of the development of objective video quality metrics in Section Subjective Video Quality Assessment In subjective video quality assessment, a set of test video sequences are presented to the human observers (also referred to as test subjects). The task of the human observers is to provide their opinions about the video quality. There are two basic forms of subjective tests: the paired comparison approach and the Mean Opinion Score (MOS) approach. In paired comparison, two test videos are displayed side by side and the human observers need to judge 7

26 8 CHAPTER 2. OVERVIEW OF VIDEO QUALITY ASSESSMENT which one has a better quality. By the MOS approach, the videos are displayed one by one and the human observers are asked to rate the quality of each video. The MOS value is calculated as the mean value of the collected ratings. The judgement of human beings tends to be affected by many factors, such as the health situation, the mood as well as the surrounding environment. Therefore, to ensure the accuracy of the results, subjective tests must be conducted in a controlled manner. For this purpose, the International Telecommunication Union (ITU) has established a series of recommendations to standardize the design and procedure of subjective tests. The most important documents are ITU-R Rec. BT [ITU99] (for television applications), ITU-T Rec. P.910 [ITU98] and ITU-T Rec. BT.1788 [ITU07] (for multimedia applications). The most important aspects defined in the documents regarding preparation and conduct of subjective tests are summarized in the following Test Method When designing a subjective test, the first question one needs to answer is the purpose of the test. The standard documents recommend different test methods addressing various test scenarios. The test method should be carefully selected depending on the specific goal of the test. The following is a brief description of the most widely used test methods. Since a comparison of system performance is not the focus of this thesis, only test methods following the MOS approach are discussed. The reader can refer to [ITU98, ITU99, ITU07] for more details. ˆ Double-Stimulus Continuous Quality Scale (DSCQS) DSCQS is a Double Stimulus (DS) method defined in [ITU99], in which two videos, i.e. the original source sequence (also referred to as reference sequence) and a processed version of the same sequence, are presented twice to the test subjects. The presentation order of the two sequences is randomized, i.e., sometimes the reference sequence is presented first and sometimes the processed sequence is presented first. The test subjects are asked to give their ratings at the second presentation of each video. This voting procedure is shown in Figure 2.1a. The test subjects use a continuous grading scale as shown in Figure 2.1b for the rating. As pointed out in [ITU99], DSCQS is more resilient to contextual effects when compared with other test methods (contextual effects refer to the phenomenon that the results of the subjective tests tend to be affected by the level and ordering of the impairments that appear in the tests. For example, if an impaired test sequence is presented after several high quality test sequences, the viewers may give it a lower score than it normally deserves). This is due to the fact that the original source sequence is always available in DSCQS to serve as a reference when the test subjects rate the processed sequences. However, the use of a reference for each test sequence also

27 2.1. SUBJECTIVE VIDEO QUALITY ASSESSMENT 9 causes DSCQS to be very time-consuming and only a small number of test sequences can be evaluated during a session, which is the major disadvantage of DSCQS. (a) (b) Figure 2.1: Double-Stimulus Continuous Quality Scale (DSCQS) [ITU99]: (a) Presentation structure; (b) rating scale. ˆ Absolute Category Rating (ACR) This method is a Single Stimulus (SS) method defined in [ITU98]. In the ACR method, the test sequences are presented one at a time and the test subjects are asked to rate each sequence after the presentation. The procedure of ACR is shown in Figure 2.2a. In order to alleviate the impact of contextual effect, the presentation order of the test sequences should be randomized for each individual test subject. Typically, ACR uses a five-level categorical grading scale as shown in Figure 2.2b. A nine-level scale can also be used in case a higher discriminative power is desired, as suggested in [ITU98]. Because each sequence is presented only once before being rated, ACR allows more test sequences to be evaluated during the same time interval in comparison with DSCQS. But the drawback of ACR is that it is a SS method, so it may be seriously affected by contextual effects and therefore, ACR needs more participants to achieve the same reliability as DSCQS [ITU05a]. The efficiency of ACR is partially offset by this drawback. Due to this reason, VQEG has used an enhanced version of ACR in its Multimedia Test [VQE07]. In the improved method, the original version of each video content is randomly inserted into

28 10 CHAPTER 2. OVERVIEW OF VIDEO QUALITY ASSESSMENT the test dataset to serve as a hidden reference. Therefore, this improved ACR method is also referred to as ACR-HR (ACR with Hidden Reference). In [HTG05], a comparison is performed between ACR-HR and DSCQS for low-resolution videos, the results show that ACR-HR can provide the same reliability as DSCQS while keeping the simplicity and efficiency of ACR. (a) (b) Figure 2.2: Absolute Category Rating (ACR) [ITU99]: (a) Presentation structure; (b) rating scale. ˆ Subjective Assessment of Multimedia VIdeo Quality (SAMVIQ) SAMVIQ is a new assessment methodology defined in [ITU07]. In SAMVIQ, the evaluation is conducted scene by scene. Each scene contains all the processed test sequences of the same video content. To alleviate the contextual effects, an explicit reference and a hidden reference of the same content are also included in each scene. The hidden reference is inserted randomly into the processed test sequences. The major difference between SAMVIQ and conventional test methods (such as DSCQS and ACR) is that the test subjects can control the order of the presentation as well as start/stop the presentation of a test sequence at any time. There is no strict timing for the rating of each test sequence. The test subjects can freely make comparisons between a processed test sequence and the reference sequence or between two processed test sequences and then give or adjust their rating for individual test sequences accordingly. This allows SAMVIQ to produce reliable subjective ratings. Figure 2.3 shows the presentation structure and rating scale for SAM V IQ.

29 2.1. SUBJECTIVE VIDEO QUALITY ASSESSMENT 11 In [BHTHB06], the performance of SAMVIQ and ACR-HR is compared using test sequences with CIF (352x288) resolution. The results suggest that the subjective ratings produced by both methods are very similar. Considering the higher efficiency of ACR- HR (In SAMVIQ, test subjects tend to spend more time to make comparisons between different sequences), ACR-HR is considered to be the preferred method. In comparison, the impact of spatial resolution on the result accuracy of SAMVIQ and ACR-HR is studied in [PP08]. The results show that for video contents of high resolutions (VGA/HD), the results from SAMVIQ are more precise than those from ACR-HR for the same number of test subjects. According to the above discussion of different test methodologies as well as the number and spatial resolution of the test sequences, SAMVIQ is selected as the test method in the work presented in Chapter 3 for collecting subjective ratings Test Material Since the purpose of video quality assessment is to evaluate the performance or help optimize the QoE of a certain video processing system, the target application of the system under consideration should be taken into account when selecting the test materials. Also, to improve the reliability of the test results, it is important that a wide variety of materials are used in the test. The variety of the test materials refers to not only the diversity of the video contents but also the quality range of the processed sequences. In the subjective tests conducted by VQEG [VQE00, VQE03, VQE08, VQE09], the Spatial perceptual Information (SI) and Temporal perceptual Information (TI) are used to determine the characteristics of the video contents. The two parameters are defined as: SI = max time {std space [Sobel(F n )]} (2.1) T I = max time {std space [F n F n 1 ]} (2.2) where F n denotes the video frame at time n and Sobel(F n) is the filtered frame by the Sobel filter. Sobel filter is widely used in image processing algorithms to compute an approximation of the gradient magnitude at each point in the input image. The 2D Sobel filter uses a pair of 3x3 convolution kernels given in Eq.(2.3): K x = and K y = (2.3)

31 2.1. SUBJECTIVE VIDEO QUALITY ASSESSMENT 13 and the filtered frame of the Sobel filter is calculated as: G x = K x A and G y = K y A G = G x 2 + G y 2 (2.4) where A is the input image and G is the filtered image. The operator denotes the 2D convolution operation. The selected video contents should span the full range of scene characteristics which is of interest to the system under test. To achieve precise and reliable quality ratings, the quality range of the test materials should be as large as possible. Otherwise, if the quality range is too narrow, the test subjects tend to give quality scores which exaggerate the quality difference of two test sequences. In most cases, it is a good practice to include processed sequences with extremely high and low quality in the test material Test Subjects When selecting the test subjects, the number and type of the viewers should be carefully considered. In most standard documents [ITU98, ITU99, ITU07], it is suggested that the number of test subjects should not be less than 15 in order to produce reliable results. In practice, the appropriate number of participants should be selected according to the reliability of the test method as well as the expected precision of the results. For example, VQEG recommends to use at least 24 test subjects for its Multimedia Test [VQE07] using the ACR- HR method while the European Broadcast Union (EBU) suggests to use at least 15 test subjects in its video codecs evaluations [KSW05] using SAMVIQ. Two types of test subjects should be distinguished, i.e. experts and non-experts. The term non-expert refers to people who are not directly concerned with picture quality as part of their normal work and are not experienced assessors (quoted from [ITU07]). All the standard documents suggest that the test subjects in the subjective tests should be non-experts. The consideration here is that experts tend to have a fixed or preconceived way of evaluating the image/video quality which is different from that of non-experts. Since non-experts compose a much larger part of the public who consume the video contents, the results from non-experts are more representative and reliable. That does not mean, however, that the test subjects do not have any background knowledge about the tests. They need to understand the type of artifacts and quality range that are expected in the tests. This can be done through a training session before the formal test session Test Procedure In general, a subjective test can be divided into five phases: preparation, introduction, training session, test session and post-processing.

32 14 CHAPTER 2. OVERVIEW OF VIDEO QUALITY ASSESSMENT In the preparation phase, the test environment (including the room used for the test as well as the display devices) should be set up according to the guidelines provided in [ITU99]. The visual acuity of the test subjects should also be checked. Before the test starts, both a written and an oral introduction should be given to the test subjects. The content of the introduction should include the timing and organization of the test, how the test sequences will be presented, how the test subjects should rate the sequences and sometimes, the expected types of impairment which occur in the tests, etc. After that, a training session should also be provided to help the test subjects get familiar with the test interface, voting process as well as the types of video contents and visual artifacts in the test. The procedure and video materials used in the training session should be similar to those of the formal test session, but the same video contents should not be included in the test session again. The ratings collected from the test session should not be considered in the final results. Any questions from the test subjects about the test can be answered at this point. After the formal test session begins, no further questions are allowed. After the test sessions, the collected subjective data should be screened and outliers should be removed. Different screening processes are defined in the standard documents for the use of different test methodologies. In Section 3.3.6, the screening process defined for SAMVIQ is discussed in more details. For more information about different screening processes, the reader can refer to the corresponding ITU recommendations [ITU98, ITU99, ITU07]. The standardization efforts discussed above have made subjective tests the most reliable way for video quality assessment. Although subjective quality assessment is not suitable for real-time applications, they are still very important in the sense that they provide the ground-truth data for the design and verification of objective video quality metrics which enable real-time in-service video quality evaluations. Also, the information from the subjective tests can help us to understand the properties and limits of the human visual system. 2.2 Objective Video Quality Assessment Objective video quality assessment can be used instead of subjective quality assessment whenever the involvement of human beings needs to be avoided. It can be useful for a number of scenarios throughout all the phases of building a video processing and communication system, such as: ˆ Estimation of necessary resources to deliver a certain quality level at the planning stage of a network service. ˆ Comparison of different processing algorithms when designing the system. ˆ Verification of system performance during the testing phase.

33 2.2. OBJECTIVE VIDEO QUALITY ASSESSMENT 15 ˆ Monitoring and optimization of the perceived video quality when the system is running. The basic idea of objective video quality assessment is to use mathematical quality metrics to estimate the perceptual video quality in an automatic and objective manner. The most widely used objective video quality metric nowadays is perhaps the Peak Signal-to-Noise Ratio (PSNR), which can be calculated as follows: and MSE = 1 W H W i=1 j=1 H [I 1 (i, j) I 2 (i, j)] 2 (2.5) P SNR = 10 log 2552 MSE (2.6) where W and H denote the width and height of the picture, I 1 and I 2 are the corresponding frames in the original and processed video, respectively. Since the human eyes are more sensitive to the details of the luminance component in an image or video than those of the chrominance components, normally the PSNR value is only calculated for the luminance component. And the PSNR value for a video sequence is calculated as the average PSNR value over all the frames included in the sequence. The popularity of PSNR is largely due to its simplicity and clear physical meaning. It does provide a good estimation of the perceived video quality as far as the video content and the type of distortion are not changed [EF95, HTG08]. However, for more complicated cases where different video contents, different frame rates and spatial resolutions are to be considered, the performance of PSNR is not satisfactory [Gir93, EF95, Win99, WB02]. The major drawback of PNSR is that it measures only the fidelity of the signal without considering the characteristics of the video content, the Human Visual System (HVS) as well as the interaction between the two. In this sense, there is no difference between a video signal and an audio/speech signal or signals of any other type. In [WM08], this kind of pure fidelity measurement is named data metric to differentiate it from perceptual metrics where psychophysical aspects are considered Classification of Objective Video Quality Metrics If the HVS is considered as a processing system, then the video content is its input and the perceived video quality is its output. One straightforward way for predicting the system output is to find out the internal components of the system and to model the behavior of the fundamental functional blocks. This is the basic idea behind the so called HVS-based approach [WM08]. Over the years, several famous HVS-based VQMs have been proposed such as the Visible Difference Predictor (VDF) by Daly [Dal93], the Sarnoff model proposed by Lubin [Lub97], the Perceptual Distortion Metric (PDM) proposed by Winkler [Win98, Win99], as well as the Digital Video Quality (DVQ) proposed by Watson [WHM01]. In [WSB03a], a

34 16 CHAPTER 2. OVERVIEW OF VIDEO QUALITY ASSESSMENT general framework of these HVS-based VQMs is summarized as shown in Figure 2.4. From the framework, it can be seen that several most important perceptual features (such as light adaptation, contrast sensitivity, masking and facilitation, error pooling, etc. ) of the HVS are considered and integrated to imitate the visual perception process of humans. The biggest challenge for the VQMs of this category is that the human visual perception is a very complex process which involves not only the signal reception in the eyes but also the processing of the resulting signals in the human brain. Although our understanding of the whole system is much better than a decade ago, there is still a long way ahead of us until the visual perception can be modeled accurately enough. Also, the computational complexity required by the HVS-based approach has limited the scope of its possible application. Since the target of the work in this dissertation is to build a video quality metric for real-time video adaptation, HVS-based metrics are not our focus. Interested readers can refer to the overviews in the literature for more details of HVS modeling [WSB03a, UE07, WM08, CSRK11]. Figure 2.4: General framework of HVS-based visual quality metrics (adopted from [WSB03a]). The second way of modeling the system is to treat it as a black box. Then the system response can be approximated by observing the relationships between the input and output signal. In [WM08], it is referred to as the Engineering Approach. In this way, complicated modeling of the building blocks of the HVS can be avoided and the problem of predicting the output signal can be solved by numerical approaches. Although the accuracy and universality of the engineering approach is not as good as the HVS-based approach, it is more suitable for real-time applications. Therefore, the engineering approach is adopted in Chapter 3 for developing the video quality metric. Another traditional classification of video quality metrics is based on the amount of reference information available for quality estimation [ITU00]. If the metric requires the access to the whole reference video sequence for the quality estimation of a distorted video (as shown in Figure 2.5), then it is classified as a Full-Reference (FR) video quality metric. When humans determine the quality of an image/video, it is always helpful to have the original visual content as a reference for the comparison (for example, to identify the type and strength of distortions). Similarly, it is generally accepted that the use of more reference information can help to reduce the complexity and improve the accuracy of the quality metric [ITU00].

35 2.2. OBJECTIVE VIDEO QUALITY ASSESSMENT 17 However, due to the dependency on the whole reference information, FR metrics can only be used in applications where the original video content is available at the place where quality estimation is performed, such as quality optimization at the source side or test in a laboratory scenario. For a broader range of video systems where the quality estimation needs to be done in the middle of the network or at the end-user side, FR metrics are not feasible. This has promoted the development of No-Reference (NR) video quality metrics, where the video quality is estimated solely on the distorted video contents without any reference to the original content (as shown in Figure 2.7). NR quality metrics can be used at any place within the system, so they can be applied to a wider range of applications (such as for real-time quality estimation in a transmission scenario). However, the design of NR quality metrics faces more difficulties than FR metrics due to the lack of reference information. This is reflected by the number of established ITU-T standards for different classes of quality metrics as discussed in Section The third class of video quality metrics is the Reduced-Reference (RR) metrics, which can be seen as a compromise between FR and NR solutions. In RR metrics, normally a set of important video features are extracted at the source side and transmitted using an ancillary communication channel to the place where the video quality is estimated. The same features are also extracted from the distorted video contents and quality degradations caused by distortions is estimated by comparing the features from both sides (as shown in Figure 2.6). As discussed above, the more reference information is available, the more accurate is the quality estimation. But this also requires more transmission capacity of the ancillary channel. So the most critical issue in the design of RR metrics is the tradeoff between accuracy and the amount of overhead information. Figure 2.5: Block diagram of a full-reference video quality assessment system (adapted from [ITU00]).

36 18 CHAPTER 2. OVERVIEW OF VIDEO QUALITY ASSESSMENT Figure 2.6: Block diagram of a reduced-reference video quality assessment system (adapted from [ITU00]). Figure 2.7: Block diagram of a no-reference video quality assessment system (adapted from [ITU00]) Advances of Objective Video Quality Metrics Due to the increasing demand of reliable and accurate video quality metrics, there has been a large amount of effort devoted to this topic from both industry and academia. The most remarkable work has been done by VQEG from ITU. From 1997, VQEG has conducted a number of validation tests to evaluate the performance of various proposed VQMs. Based on the test results, VQEG has also established a series of standards which give recommendations for the choice of objective video quality metrics for different applications. A summary of the work by VQEG is given in Table 2.1. Apart from the standardization efforts from VQEG, there are also contributions in other literatures. In the following, a review of several most important works in the field of perceptual video quality metrics will be given. As mentioned previously, the focus is put on metrics following the engineering approach. Full-reference video quality metrics The Structure SIMilarity (SSIM) index is proposed by Wang et al. in [WBSS04]. Similar to PSNR, SSIM does not make any assumption about the type of artifacts in the video. But different from PSNR, which calculates the picture quality based on pixel-to-pixel errors, SSIM estimates the quality by measuring how well the structural information contained in the picture is preserved. Since it is observed that human perception is more sensitive to distortions in

37 2.2. OBJECTIVE VIDEO QUALITY ASSESSMENT 19 Table 2.1: ITU Recommendations for objective video quality metrics ITU Standard Metric Type Target Application Validation Test ITU-T J.144[ITU04b] FR-TV1/FR-TV2 FR SDTV ITU-R BT.1683[ITU04a] ( ) ITU-T J.249[ITU10] RR SDTV RRNR-TV ( ) ITU-T J.247[ITU08b] FR Multimedia MM-I ( ) ITU-T J.246[ITU08a] RR Multimedia MM-I ( ) ITU-T J.341[ITU11a] FR HDTV HDTV-I ( ) ITU-T J.342[ITU11b] RR HDTV HDTV-I ( ) structural information [WBSS04], SSIM provides much better quality predictions than PSNR. To measure the structure similarity between the original content x and distorted content y, SSIM calculates the following three components [WBSS04]: l(x, y) = 2 x y x 2 + y 2 (2.7) c(x, y) = 2σ xσ y σx 2 + σy 2 (2.8) s(x, y) = σ xy σ x σ y (2.9) where x and y are the mean pixel values of x and y, respectively. σ denotes the standard deviation. The first two components, i.e. l(x, y) and c(x, y), can be seen roughly as measures of similarity of brightness and contrast between x and y, respectively. The third components s(x, y) is the linear correlation of the two signals, which is a indication of how well the structural information is preserved. The SSIM index is then calculated as the product of the three components: SSIM(x, y) = l(x, y) c(x, y) s(x, y) (2.10) The range of SSIM index is [0, 1], with a higher value indicating better perceptual quality. The SSIM index has been originally proposed for still image quality assessment. [WLB04], it is adapted for video quality assessment by calculating the weighted sum of the SSIM indices of the Y, Cb and Cr components. Other extensions of the SSIM index include the MultiScale-SSIM in [WSB03b] and the Speed SSIM proposed in [WL07]. The Psytechnics full-reference video quality metric is one of the four metrics suggested by ITU-T J.247 [ITU08b] for multimedia applications with spatial resolutions from QCIF to VGA. It performed best in VQEG s Phase-I Multimedia Test [VQE08]. After the spatial and temporal alignment process between the reference and distorted video, seven features are extracted from the videos. The spatial distortion is measured by decomposing the frames into sub-bands using a pyramid transform (similar to the concept of wavelet transform) and In

38 20 CHAPTER 2. OVERVIEW OF VIDEO QUALITY ASSESSMENT then calculating the PSNR values for selected sub-bands according to the spatial resolution of the frames. The temporal distortion is calculated based on the frequency and duration of dropped/frozen frames. These features, together with other features measuring the edge distortion, the blocking artifacts, the blurring artifacts and spatial complexity, are combined by a linear integration function to produce a final estimation of the video quality. Different integration functions are used for different spatial resolutions (QCIF/CIF/VGA). VQuad-HD developed by SwissQual is the only full-reference video quality metric suggested by ITU-T J.341 [ITU11a] for HDTV applications. A jerkiness feature is calculated based on the local and global motion intensity to indicate temporal degradation. A blockiness measure is used to measure the spatial degradation. The basic idea of the blockiness detection algorithm is that if the video is processed using a block structure of size n, then average edge strength values calculated at a step-size of n could be very different for different offsets. The third component is calculated as the distribution of the local similarity and difference features. Finally, a logistic function is used to integrate these three features together for the quality estimation. Other important full-reference video quality metrics include the Picture Quality Scale (PQS) [YMM00], the Perceptual Evaluation of Video Quality (PEVQ) by Opticom [ITU08b], the Motion-based Video Integrity Evaluation (MOVIE) index by Seshadrinathan and Bovik [SB10]. Reduced-reference video quality metrics In [WJP + 93], Webster et al. proposed a reduced-reference quality metric based on localized TI and SI values (refer to Eq.(2.2)(2.1)). The TI and SI are calculated for a certain Spatial- Temporal region (S-T region) in the video sequence. The values from the original video are transmitted to the quality estimator and compared with the values calculated from the distorted video. The outputs of the comparison are three measurements indicating the level of spatial and temporal distortions. A weighted sum of these three measurements is used as the quality estimation. The size of the overhead information can be controlled by selecting a suitable size of the S-T region. The Yonsei reduced-reference quality metric (proposed by Yonsei University, Korea) is the only metric which is included in all the three RR video quality standards by ITU-T (J.246 [ITU08a] /J.249 [ITU10]/J.342 [ITU11b]). In this scheme, an edge map is generated by applying edge enhancement filters to the original video frames. The position and value of a set of edge pixels are transmitted to the quality estimator. The quality estimator calculates again an edge map based on the distorted video frames. The pixel values in the distorted edge map are then compared with the corresponding transmitted values to calculate the Edge PSNR (EPSNR). EPSNR is then adjusted according to the strength of different artifacts

39 2.2. OBJECTIVE VIDEO QUALITY ASSESSMENT 21 (blocking/blurring/jerkiness). A piecewise linear function is finally applied to form the quality estimation based on EPSNR. By adjusting the number of position/value pairs transmitted to the estimator, a compromise between prediction accuracy and side information is achieved. Perhaps the most widely used RR metric is the General Model of the Video Quality Model (GMVQM) from National Telecommunications and Information Administration (NTIA) [WP99, WL07]. It is included in ITU-T J.144 [ITU04b] and ITU-T J.249 [ITU10] for SDTV application (although ITU-T J.144 is a standard for full-reference quality metrics, the techniques used in GMVQM are actually reduced-reference). It was the best-performing metric in VQEG s FR-TV tests [VQE00, VQE03] and also presented good performance in VQEG s RRNR- TV test [VQE09]. In GMVQM, a number of video features, such as SI/TI (as shown in Eq.(2.1)(2.2)), ratio between the strength of Horizontal/Vertical edges and diagonal edges, mean chrominance pixel values, standard deviation of luminance component, are calculated for both the reference and distorted videos. Then the strength of various artifacts, such as blocking, blurring, noise, jerkiness and color distortion, are measured according to the gain or loss of these features. These measurements are then combined by a linear function to provide an estimate of the overall video quality. Similar to the Webster metric, the rate required to transmit the features from the reference video can be controlled by adjusting the size of the S-T region for which the video features are calculated. Another technique that can be used to implement reduced-reference video quality metrics is data hiding (such as watermarking) [FCM05, NCA06, CMB02]. Although the schemes based on data hiding are often classified as no-reference quality metrics, the quality estimator does need certain shared information from the reference video (for example, in the case of watermarking, the undistorted version of the watermark needs to be available). From this point of view, it is more suitable to consider them as reduced-reference metrics. No-reference video quality metrics The main focus of previous works in the field of objective video quality assessment has been put on FR and RR metrics. Due to the ever increasing demand of in-service quality monitoring, more and more research efforts have been devoted to the development of NR metrics in the recent years. According to the video features used in the metrics, no-reference metrics can be further divided into 3 categories [THB08]: bitstream layer metrics, media layer metrics and hybrid metrics. Bitstream layer metrics utilize information from the encoded video bitstreams as well as information related to network performance (such as packet-loss rate). The metrics require no or only partial decoding of the encoded bitstream, so they can be used in lightweight solutions for quality estimation. But since they do not fully exploit content characteristics, they are often less accurate than media layer metrics. ITU-T G.1070 [ITU12] describes quality assess-

40 22 CHAPTER 2. OVERVIEW OF VIDEO QUALITY ASSESSMENT ment metrics for videophone applications over IP networks. The content contains metrics for speech, video and overall multimedia quality. The video metric in ITU-T G.1070 is based on the work of Yamagishi et al. in [YH06, HYTT07]. The metric models the quality degradation caused by coding distortion and transmission errors separately. The coding quality of the video is modelled as the product of a power function of bitrate and an exponential function using both the bitrate and frame rate. The transmission quality of the video is modelled based on an exponential function of the packet loss ratio, which also considering the impact of frame rate and bitrate. The overall video quality is estimated using the product of the two items. The metrics include 12 model parameters which need to be trained for different video codecs. Suggested parameter values for different video resolution and codecs are also given in the recommendation. The metric above considers any random packet loss and in [BM10], it is extended by considering the duration and strength of burst packet loss. In [RGSM + 08], a bitstream layer metric for SD and HD IPTV applications is proposed. Similar to the metric in ITU-T G.1070, the coding distortion and transmission distortion are modelled separately. The coding distortion is based on an exponential function of bitrate and the transmission distortion is modelled using bitrate and packet loss rate. To take into account the video content, the same authors proposed in [GSR10] a new model for the coding distortion using information from the encoding process such as motion vectors and quantization parameters. Other bitstream layer models include [RCNR07, KSI09, KKHD11]. Media layer metrics assess the video quality based on the decoded pixel values. Most NR metrics in this category try to estimate the video quality by measuring the physical strength of different types of artifacts and their psychophysical impact on human perception. The main artifacts considered are blocking, blurring, ringing and motion jerkiness. For a complete review of models for different types of artifacts, the readers can refer to [HR10, Cha13]. Usually, a distorted video stream contains more than one artifact, so metrics considering the impacts of multiple artifacts are more robust and practical. In [FM05], such a metric is proposed by accounting for three artifacts. The blocking artifact is measured by comparing the correlations between adjacent pixels within and across the borders of the block structure used in the codec. The blurring is estimated by examining the spread of edges in the frame. To measure the noisiness, the frames are first filtered to remove its nature structure (such as edges and textures) and keep only the noise. Then the noise variance is calculated to estimate the strength of the noise. The overall frame quality V Q is modelled by using a weighted p-minkowski metric to combine the three measurements (see Eq(2.11)). V Q p = (α Blockiness p + β Blurriness p + γ Noisiness p ) 1/p (2.11) where p, α, β and γ are parameters which are determined by least-squares fitting. Although humans can easily identify the type and strength of visual distortions in a video without the reference to the original content, it is not an easy job for NR quality metrics.

41 2.3. SUMMARY 23 To address this issue, a extensive framework has been proposed in [MB11] for blind image quality assessment based on Nature Scene Statistics (NSS). The basic assumption is that natural scenes hold certain statistical properties which tend to be destroyed by distortions, so the abnormality of picture statistics is a good indication for the type and strength of different distortions. The proposed algorithm first extracts 88 statistical features from the content, then two vectors are calculated based on these extracted features. The first vector tells the probabilities of the content suffering from different types of artifacts and the second vector estimates the resulting picture quality when the content is affected by a certain artifact. The overall quality is computed as the inner product of the two vectors. Hybrid metrics aim to combine the merits from media layer metrics and bitstream layer metrics by combining all the available information. On one hand, the decoded pixel information can help to improve the accuracy of the estimation. On the other hand, information from the network and the bitstream can be used to extract video features more efficiently and thus avoid unnecessary computation. In [KHD12], Keimel et al. propose a NR hybrid video quality metric for HDTV content coded with H.264/AVC. The metric utilizes features extracted from both the bitstream (such as slice type, average quantization parameter for each slice, motion information, percentage of different MB type, etc.) and the pixel domain (such as blocking/blurring measurements, motion continuity, edge continuity, etc.). The weighted sum of these features is then used in a sigmoid function for the estimation of the overall quality. In comparison to a previous metric using bitstream layer information [KKHD11], the hybrid metric provides a better prediction accuracy. The hybrid metric also outperforms FR metrics such as PSNR, SSIM and GMVQM according to the evaluation. VFactor is a patent-protected hybrid video quality metric which is used for many commercial applications for quality monitoring [Che]. According to the introduction in [WM08], VFactor uses not only information from the decoded pixel domain and the video coding layer of the bitstream, but also those calculated from the Packetized Elementary Stream (PES) layer (such as timing information) and Transport Stream (TS) layer (packer loss, delay and delay jitter, etc.). The development of hybrid video quality metrics is also a main focus of the VQEG. One of the initial work focuses of the Joint Efforts Group (JEG) newly formed by VQEG is to develop a no-reference hybrid video quality metric for H.264/AVC. 2.3 Summary In this chapter, both objective and subjective methods for video quality assessment are introduced. For the discussion of subjective video quality assessment, the guidelines provided in the ITU standards are summarized. Also, the advantages and disadvantages of different test

42 24 CHAPTER 2. OVERVIEW OF VIDEO QUALITY ASSESSMENT methodologies are analyzed. This is followed by a review of the previous works on objective video quality assessment. Different approaches for designing objective video quality metrics are discussed and the metrics are classified according to the utilized information. From the review, it can be seen that most of the achievements so far are in the field of full-reference and reduced-reference quality metrics. Most no-reference metrics proposed so far are distortion and application specific due to the lack of reference information. It is still very difficult to build generic no-reference metrics without a deeper understanding of the HVS. In Chapter 3, the guidelines presented in this chapter are followed to conduct extensive subjective quality assessments, and the results are used to develop a no-reference objective video quality metric for QoE-driven multi-dimensional video adaptation.

43 Chapter 3 Perceptual Video Quality Modeling In this chapter, the impact of frame size, frame rate and quantization on the perceived quality of a video is explored and a Multi-Dimensional Video Quality Metric (MDVQM) is proposed to estimate the video quality in the presence of quantization, frame dropping and spatial downsampling. The SNR video quality is captured by a logistic function whereas the impact of frame rate reduction and spatial down-sampling are modelled separately as temporal/spatial correction factors. The overall video quality metric is then calculated as the product of these components. The proposed metric uses only several features that can be easily extracted from the bitstream or decoded frames and thus is practical for real-time video adaptation applications. 3.1 Introduction The remarkable evolution of communication networks has enabled video content delivery over mobile networks. The increased power of end-devices and the user s ever-increasing demand for video content further boosted the popularity of video applications. The video quality perceived by the end users is the most crucial factor for the success of video services. Therefore, in-service monitoring and optimization of the video quality is becoming more and more important for service providers. As discussed in Chapter 1, subjective quality assessment is not feasible in this scenario due to the involvement of human observers and it can only be achieved by employing objective video quality metrics which can estimate the perceived video quality automatically and accurately. Many quality metrics have been proposed so far in the literature and some have already been used in commercial solutions. However, the heterogeneity of the end-users brings new challenges for quality estimation. Most prior video quality metrics deal with a fixed spatial and temporal resolution. Meanwhile, as mentioned in Chapter 1, transmitted videos often need to be adapted to a different display size and frame 25

44 26 CHAPTER 3. PERCEPTUAL VIDEO QUALITY MODELING rate. Hence, it is important to develop new video quality metrics which consider the impacts of different adaptation schemes on the perceived video quality. Typically, the video adaptation can be performed by changing either the Quantization Parameter (QP), the frame rate/temporal Resolution (TR), or the frame size/spatial Resolution (SR). Using a larger QP results in stronger coding artifacts (e.g. blocking artifacts for block-based hybrid video coder). Reducing TR by dropping frames affects the smoothness of motion and reducing the SR by spatial down-sampling introduces blurring artifacts if the video is later up-sampled and displayed in the original resolution. In the following, unless otherwise stated, the term SNR Video Quality (SNRV Q) is used to denote the video quality resulting from quantization only. The term Spatial Video Quality (SV Q) and Temporal Video Quality (T V Q) are used to refer to the perceived video quality when a reduction of SR and TR is performed, respectively. The term Spatial-Temporal Video Quality (ST V Q) is used to denote the video quality when both TR and SR are reduced. The remainder of this chapter is structured as follows. Section 3.2 reviews the related work on video quality assessment involving quantization, frame rate and frame size, separately and jointly. Section 3.3 gives a description of the conducted subjective tests. In Section 3.4 the results of the subjective tests are analyzed and a novel no-reference video quality metric is introduced. The performance of the proposed quality metric is evaluated and compared with the related metrics in the literature. In Section 3.5, the work presented in this chapter is summarized. 3.2 Related Work Several subjective studies have been performed and reported in the literature to analyze the impact of frame rate and spatial resolution on the subjective quality. In [WCL04], the authors study the preference of frame rate by performing subjective tests using CIF (352x288) sequences encoded at different bit-rates ( kbps) and frame rates (30fps/15fps/7.5fps). The results show a general trend that the preferred frame rate reduces when the encoding bit-rate decreases. The sequences are further divided into three categories according to their content complexity and the analysis shows that for videos of different categories, the switching bit-rates of the optimal frame rate vary significantly, which indicates the content dependency of the user preference - the higher the content complexity, the higher the switching bit-rates. In [CT07], the results from a number of previous studies are summarized to study the effects of different frame rates on human perception for various scenarios. The finding is that although the results vary slightly according to the task, the viewing condition and the viewers characteristics, the minimum frame rate should be kept between 10-15fps to achieve an acceptable performance.

45 3.2. RELATED WORK 27 The study in [WSV + 03] investigates the impact on subjective quality of QP, spatial resolution and frame rate for H.263 encoded videos. Two subjective tests are conducted using five source sequences with an original resolution of 320x192 at 30 fps. The first one studies how the subjective quality is affected when jointly adjusting QP and the spatial resolution while the second one focuses on jointly adjusting QP and frame rate. The overall conclusion is that human vision is more sensitive to quantization artifacts than blur and motion jerkiness, especially at middle and low bit-rates. The authors suggest that when the QP used for encoding reaches a certain threshold, the frame rate and/or spatial resolution should be changed in order to achieve a better subjective quality and the QP threshold depends highly on the spatial/temporal activity of the content. A similar study of the joint impact of the same parameters (QP, SR and TR) for low bit-rate cases is conducted in [ZCL + 08]. Both H.263 and H.264/AVC codecs are used to encode the test sequences at a constant bit-rate (in comparison to [WSV + 03], where constant QP values are used). The test results confirm the conclusions in [WSV + 03]. In [LSR + 10], an extensive study is performed for HD resolution videos encoded with two scalable video codecs - H.264/SVC and a wavelet-based scalable video codec (W-SVC). The sequences are encoded for a wide range of bit-rates (from 300kbps to 4Mbps) using 3 spatial layers (HD/640x360/320x180) and 4 temporal layers (from 50fps to 6.25fps). The conclusion is that when the bit-rate is small, it is preferable to reduce the spatial resolution from HD to 640x360 to prevent strong blocking artifacts. But further spatial-downsampling (to 320x180) should be avoided due to the strong blurring artifacts caused by up-sampling back to HD. While for relatively high bit-rate cases, since a certain level of spatial quality is already guaranteed, a higher frame rate is more desirable than a higher spatial resolution. It is also found that although the choice of codec type does have influence on the test results, the overall tendency is consistent across the two codecs. The above works do not propose any concrete video quality metrics for different spatial and temporal resolutions. In [LLS + 05], the authors propose a metric based on an expo-logarithm function of the frame rate to estimate the negative impact of frame dropping on the perceived video quality. The average of every frame s maximal motion vector magnitude is used in the metric as a representation of the motion intensity to consider the impact of the video content. Another work in [QG08] considers the jitter and jerkiness effects. A subjective study is conducted, in which the video quality is deteriorated by frame dropping with varying strength, burst length and frequency. An interesting finding is that jitter is more annoying than jerkiness, therefore the change of frame rate should not be performed too frequently. Unfortunately, only the jerkiness effect is modelled with a sigmoid function of the frame rate. A problem in the above metrics is that only the temporal quality of the video is considered which has limited their

46 28 CHAPTER 3. PERCEPTUAL VIDEO QUALITY MODELING application in practice. A video quality metric QM is proposed by Feghali et al. in [FWSV07]. The metric considers both the SNR and the temporal quality of the video. The video quality is estimated simply by the average PSNR value when no frame rate reduction is conducted (FR=30fps). In case of frame dropping, the PSNR value is significantly affected due to the difference between the repeated frames and the original frames. To address this issue, the authors propose to add a compensation term to the PSNR value which depends on the frame rate and motion intensity for a more accurate estimation of the overall quality. The motion intensity is estimated by the average magnitude of the top 25% of the largest motion vectors in each frame. In [KJSR08, SYN + 10], the above metric is extended by considering also the impact of spatial characteristics of the video. The SNR quality is still estimated by PSNR and temporal quality is modeled similarly except that the motion activity measure is calculated as the standard deviation of the motion vector magnitudes. Spatial quality is modelled as a sigmoid function of the height of the frame in [KJSR08] and in [SYN + 10] it is modeled using an exponential function of the height and a spatial activity measure. The overall quality is computed as the weighted sum of the three quality values. However, the accuracy of motion vectors is strongly affected by the chosen motion estimation algorithm and sometimes also the bit-rate (which affects the quality of the reference frames), therefore the above metrics sometimes suffer large estimation errors. In [OMW09, OMLW11], Ou et al. propose the metric VQMTQ which models the impact of frame dropping and quantization on the perceived video quality. The overall video quality is estimated as: ( SQF = ˆQ max 1 1 p(sp SNR s) 1 + e ) (3.1) f fmax T CF = 1 e αt 1 e α t (3.2) V QMT Q = SQF T CF (3.3) where ˆQ max is the subjective rating for the highest quality video (which is empirically set to 90 for a MOS scale). f and f max are the frame rate after and before frame dropping, respectively. α t, p and s are parameters depending on the video content. SQF estimates the SNR quality of the video and T CF is a correction factor modeling the negative impact of frame dropping on the video quality. In [XOMW10, OXMW11], VQMTQ is extended to QSTAR, where the impact of frame size is also considered by introducing a spatial correction factor: ( ) β s s smax SCF = 1 e αs (3.4) 1 e αs

47 3.2. RELATED WORK 29 where α s and β s are content dependent parameters. QST AR = V QMT Q SCF (3.5) In [PS11], Peng et al. propose a full-reference video quality metric STVQM for the estimation of SNR and temporal video quality: SV QM = e (SP SNR+ws SA+wt T A µ)/s (3.6) T V QM = 1 + a T A b 1 + a T A b 30 F R (3.7) ST V QM = SV QM T V QM (3.8) where SVQM and TVQM model the SNR video quality and quality degradation caused by frame dropping, respectively. SPSNR is the spatial PSNR (which is computed by averaging the PSNR values over the non-dropped frames). w s, w t, µ, s, a and b are parameters that need to be trained from the quality ratings collected from the subjective tests. TA and SA are measures of spatial and temporal activity of the video content, respectively. TA and SA are calculated by the following equations: SA = mean time {std space [Sobel(F n )]} (3.9) T A = mean time {std space [F n F n 1 ]} (3.10) It can be seen that the calculations of TA and SA are very similar to that of TI and SI in Eqs.(2.2)(2.1), except that the average value over time is calculated instead of the maximum value. According to the evaluations in [PS11], VQMTQ and STVQM provide significantly better estimation performance than QM, while the performance difference between VQMTQ and STVQM is not statistically significant. More concretely, for both VQMTQ and STVQM, the Pearson Correlation (PC) with the ratings from subjective tests is higher than 0.95 and the Root-Mean-Square Error (RMSE) is less than 10 on a MOS scale. Summarizing the above results, almost all the current quality metrics which deal with the problem of multi-dimensional optimization of perceived video quality are FR metrics and designed based on PSNR. Although they can provide accurate prediction of video quality, it is not feasible to use them for real-time video adaptation inside the network due to the absence of the original video which is requested for the PSNR, SA and TA calculation. In this chapter, a no-reference video quality model named MDVQM is proposed to address the demand for multi-dimensional video adaptation. The impacts of quantization, frame rate and frame size

48 30 CHAPTER 3. PERCEPTUAL VIDEO QUALITY MODELING are modelled separately and the overall video quality is determined as the product of theses different factors. The metric uses only two activity measures from the video content and thus is computationally efficient. Validation tests show that the quality predictions of the metric correlate very well with subjective ratings obtained in subjective tests. 3.3 Details of the Subjective Study In order to understand how different factors (i.e. quantization, frame rate and frame size) affect the perceptual video quality, two separate subjective tests are conducted. The first test (Test I) focuses on the impact of individual impairments such as those caused by frame dropping or spatial down-sampling. The second test (Test II) aims to evaluate the impact on video quality when TR and SR are changed at the same time. Since the current 3G mobile networks employ powerful error correction techniques at the physical and link layer, it is assumed in this work that the channel impairments such as biterror and packet loss are hidden from the application layer, so that from the perspective of the video applications, changing channel conditions are only reflected by varying transmission rates, which define the target rate for the video adaptation. Therefore, network errors are not explicitly considered during the design of the subjective tests and the development of objective quality metrics. Furthermore, this work focuses on the non-scalable version of H.264/AVC video, because it covers the lion share of the video traffic in today s internet. All the test materials are encoded using H.264/AVC video codecs and the proposed metric is trained based on the corresponding subjective data. Although the choice of video codec might affect the results, the analysis and evaluation in this work are general and can easily be extended to other codec types Source Sequences In Test I, eight source video sequences (SRC) with a wide range of spatial and temporal content characteristics are used. Six of them are well-known standard test sequences available from [Xip]: CREW (CR), HARBOUR (HA), SOCCER (SC), PEDESTRIAN AREA (PA), PARK JOY (PJ), FOOTBALL (FB). Two of them are internet videos from Youtube: OBAMA (OB)[Youb] and KOBE (KO)[Youa]. In Test II, three standard test sequences (PA, FB, and Rush Hour (RH)) are used. A clip of 10 seconds from each SRC is selected in order to maintain a high concentration of the subjects. All the standard test sequences are in 4CIF(704x576) resolution. The original spatial resolution of the two Youtube sequences is 1024x768 and the sequences are centercropped to 4CIF resolution. The frame rates of the SRCs are either 60fps or 30fps. Figure 3.1 shows example frames of the SRCs and their original frame rates are given in the titles.

50 32 CHAPTER 3. PERCEPTUAL VIDEO QUALITY MODELING 110 ParkJoy Harbour Kobe 80 SI 70 Football Obama RushHour Crew PedestrianArea Soccer TI Figure 3.2: Spatial Information vs. Temporal Information indices of the source videos Table 3.2: Bit-rates, frame rates and spatial resolutions of the processed video sequences for Test II SRC BR (kbps) FR(fps)xSR PA x4CIF 15xCIF FB x4CIF 15xCIF RH x4CIF 15xCIF In Test I, 12 processed video sequences (PVS) (96 in total for the 8 SRCs) are generated for each SRC. The PVSs are encoded using the open source X264 encoder in an IPPP...P structure with Constant Bit-Rate (CBR). The original rate control algorithm in X264 is replaced by a new algorithm based on ρ-domain analysis, which will be discussed in more detail in Chapter 4. The PVSs for each SRC are divided into 3 groups (4 for each group). For the first group, the original SR and TR are kept unchanged and it is referred to as the SNR group. For the second group, the PVSs are spatially down-sampled to CIF resolution, referred to as the SR group. And for the last group (referred to as TR group), the PVSs are temporally down-sampled by a factor of 2-4. For each group, the PVSs are encoded at 4 different bit-rates. A description of the encoding bit-rates, frame rates and spatial resolutions of the PVSs is given in Table 3.1. One thing to note here is that in the test a display window of fixed size (4CIF) is used, so all the spatially down-sampled sequences are resampled back to their original resolution for playback. Details of how the subjective data is split for model

51 3.3. DETAILS OF THE SUBJECTIVE STUDY 33 training and subsequent validations are given in Section In Test II, 8 PVSs are generated for each SRC, among which 4 are encoded in full-resolution and 4 are down-sampled both spatially and temporally before encoding. Within each group, the PVSs are encoded with 4 different bit-rates in a CBR manner. Detailed information of PVSs in Test II is given in Table 3.2. To avoid fatigue of the test subjects, Test I is divided into 3 subtests. The first subtest includes all the PVSs from CR, HA and SC. The second subtest includes all the PVSs from HA, PA and PJ. The third subtest includes all the PVSs from HA, FB, OB, KB. The PVSs from HA are included in all three subtests so that this common set can be later used to combine the scores from different subtests into a super dataset as will be discussed in Section Similarly, in Test II, a set of common sequences from FB and PA are included for calibration purposes Test Methodology As mentioned in Section 2.1, the SAMVIQ method [ITU07] is adopted in this work to collect subjective ratings for the test videos. A graphical software interface is developed which implements SAMVIQ for the subjective test. The central part of the interface is shown in Figure 3.3. The video is displayed at the original resolution at the center of the screen and the background is set to mid-level grey color. The test sequences are accessed through the access buttons ( REF buttons corresponds to the reference sequence and button A - M correspond to the processed sequences and the hidden reference). After viewing each sequence, the test subject can use the slider bar on the right hand side to score the sequence. The score is displayed under the corresponding access button. The slider uses a continuous quality scale from 0 to 100 and is divided into five equal intervals with annotation by five adjectival quality terms (Excellent, Good, Fair, Poor, Bad) for general guidance according to [ITU07]. If a test subject is viewing a sequence for the first time, the whole sequence should be watched and no jump to other sequences is allowed during the play (the access buttons of all other sequences are disabled during the first play). If the test subject is viewing a sequence to which a score has already been given, the playout process can be stopped and resumed (through the STOP and PLAY button, respectively). In this case, the test subject can also switch to other sequences at any time (through the access buttons). Once all the sequences in a test scene have been scored, the NEXT button can be used to proceed to the next scene (a test scene contains all the test sequences from the same source sequence). After the test subject has finished all the test scenes, the subjective test can be ended using the END button.

52 34 CHAPTER 3. PERCEPTUAL VIDEO QUALITY MODELING Figure 3.3: The graphical user interface implementing the SAMVIQ method Test Subjects A total of 56 test subjects have participated in the tests. The number of test subjects in each test is summarized in Table 3.3 (the number in the bracket is the number of subjects that are rejected by the screening process as discussed in Section 3.3.6). Note that the participants in the tests are overlapping. All the participants are non-experts, which means that they were not professionally involved in image/video quality assessment at their work. The subjects are all with normal or correct-to-normal visual acuity and between 21 and 38 years old, including both males and females Test Environment and Procedure The general viewing conditions in the subjective tests were arranged as specified by ITU-T Rec. BT.1788 [ITU07] for a laboratory environment. The room for the experiments was equipped with 17-inch LCD monitors of type FUJITSU SIEMENS SCENICVIEW B17-2 CI. The ratio of inactive screen luminance to peak luminance was kept below a value of The viewing distance is about 4 times the height of the video stimulus. A test session is divided into three phases: instruction, the training session and the formal test session. During the introduction phase, a written instruction was distributed to the participants, explaining the tasks to be performed in the tests. The training session was conducted prior to the test session to get the participants familiar with the test mechanism

53 3.3. DETAILS OF THE SUBJECTIVE STUDY 35 and to demonstrate the range of artifacts to be expected during the actual test session. The scores obtained during the training session were not considered in the final results. Questions from the subjects were allowed during the training session. The test session begins after the training session, the average duration of the test session was about 20 minutes. No question was allowed during the test session. Table 3.3: Number of subjects in the tests. The numbers in the bracket indicate the number of test subjects rejected by the screening process in each subtest as discussed in Section Test I Test Test II Sub. Sub. Sub. I II III #Subj. 18 (2) 18 (1) 24 (3) 24 (2) Subjective Data Post-Processing The screening process defined in [ITU07] is adopted to reject test subjects who may have rated randomly or inconsistently. More specifically, the Pearson correlation coefficient r p and the Spearman s rank correlation coefficient r s between the ratings of each viewer and the mean ratings of all viewers are calculated using Eq.(3.11) and Eq.(3.12), respectively: r p = N v (x i x) (y i y) i=1 (3.11) N v (x i x) 2 N v (y i y) 2 i=1 i=1 r s = 1 [R(x i ) R(y i )] 2 6 Nv i=1 N 3 v N v (3.12) where x i is the individual score of the viewer for video i and y i is the mean score of all the viewers for video i. N v is the number of test sequences. x and y are the mean value of {x i i = 1...N v } and {y i i = 1...N v }, respectively. R(x i ) is the ranking order of the score x i in {x i i = 1...N v }. Then, the correlation of individual scores from viewer j against corresponding mean scores from all the viewers is r j calculated by: r j = min(r pj, r sj ) (3.13) The rejection threshold is determined by: 0.85, if [mean(r) std(r)] > 0.85 th reject = mean(r) sdt(r), otherwise (3.14)

55 3.3. DETAILS OF THE SUBJECTIVE STUDY 37 where r = [r 1, r 2,...r j..., r Ns ] is the vector of correlation values of all the viewers. Finally, the following rejection criteria is applied: observer j is rejected, observer j is not rejected, if r j < th reject otherwise The number of rejected subjects in each test is given in Table 3.3. (3.15) After screening, the Mean Opinion Score (MOS) is calculated from the subjective ratings. Let x j i denote the rating of test sequence i given by subject j, and xj ref be the rating of the corresponding hidden reference given by the same subject. The Differential Mean Opinion Score (DMOS) value of test sequence i (denoted as µ i ) is calculated as: µ i = 1 N s (x j i N xj ref + 100) (3.16) s j=1 where N s is the number of test subjects. The DMOS value is used as the subjective quality measure for the PVSs. Note that since the raw subjective rating is in the range [0,100], it is possible that DMOS values are greater than 100, and these values are considered valid and included in the analysis. With the subjective ratings from the common set as mentioned in Section 3.3.2, the method proposed in [PW08] is used to generate a super dataset for the development of our video quality metric. Briefly speaking, an overall average (over all subtest) of the DMOS values is first calculated for each of the videos in the common set. These overall average values are considered as the most accurate measurements of the video quality. By fitting the average values of the common videos from each subtest to the overall average values, a linear mapping function is determined. This linear mapping function is used to convert the DMOS values from the subtests to form a super dataset for our later analysis. by: The Confidence Interval (CI) associated with the DMOS value of test sequence i is given [µ i δ i, µ i + δ i ] (3.17) The term δ i in Eq.(3.17) can be derived from the standard deviation σ i and the number of test subjects N s. For example, a 95% CI is calculated as: where the standard deviation σ i for test sequence i is defined as: σ 2 i = δ i = 1.96 σ i Ns (3.18) 1 N s 1 N s j=1 (x j i µ i) 2 (3.19) The derived DMOS values of all test videos, along with the corresponding 95% confidence interval are shown in Figures 3.4 and 3.5. In the figures, the blue curves always correspond to

56 38 CHAPTER 3. PERCEPTUAL VIDEO QUALITY MODELING the videos with full resolution. In Figure 3.4, the red dotted curves correspond to the videos with reduced SR and the green curves correspond to the videos with reduced TR. In Figure 3.5, the red dotted curves correspond to the videos whose TR and SR are reduced at the same time. From the results, it can be seen that at high bit rate, the blue curves are always above the other curves, which indicates full spatial/temporal resolution is preferred. With the decrease of bit rate, some blue curves intersect with the red or green curves, indicating that at a lower bit rate, reducing the spatial/temporal resolution is the better choice for transcoding. The different appearance of the curves also suggests that the characteristics of the video content have a strong impact on the perceived video quality. 3.4 Design of the NR Video Quality Metric As mentioned in the previous section, in order to understand the impact of changing different parameters, subjective tests are conducted to collect subjective quality ratings. These data serve as the ground-truth quality ratings for the design, development and validation of objective quality metrics. As mentioned above, the three considered parameters that affect the perceived video quality are QP, TR and SR. It has been shown in [OXMW11][PS11] that the impact of quantization (QP) is separable from that of TR and SR, so they are studied and modelled separately in the following SNR Quality Metric Design of the SNR Quality Metric Many objective quality metrics for measuring the SNR quality of video sequences have been proposed in the literature. According to our application scenario, video adaptation is usually performed at an intermediate network node (e.g. a proxy server) at the edge of the core and access network, where the reference video is not available. Therefore, a no-reference video quality metric is best suited for this situation. To model the SNR video quality, the first step is to select an appropriate functional form. Many PSNR-based full-reference video quality metrics, such as the PSNR-VQM in [PW02], the PEVQ in [ITU08b] and VQuad-HD in [ITU11a], choose to use the sigmoid function as the basic function form. The popularity of the sigmoid function is due to the finding from the subjective results [VQE00, VQE03] that PSNR usually only correlates linearly with the MOS values in the middle of the quality range, while saturation of MOS values appears towards the two extremes of the quality range. This phenomenon accords with the fact that human observers tend to have difficulties to identify quality difference between two videos

57 3.4. DESIGN OF THE NR VIDEO QUALITY METRIC 39 with extremely good or bad quality. The typical form of a sigmoid function can be written as: P (t) = e c(t d) (3.20) where c and d are parameters which can be used to adjust the shape of the sigmoid function. Figure 3.6 shows several sigmoid functions with different parameters. Figure 3.6: sigmoid functions with different parameters From the figure, it can be seen that the parameter c controls the dropping rate of the middle range of the curve while the parameter d can be used to control the position of the saturation point. In practice, c and d can be modelled as functions of the spatial-temporal characteristics of the video content. The full-reference metric STVQM in Eq.(3.1) also uses the sigmoid function for the estimation of SNR video quality and has been shown to provide good quality prediction. In the following, it is used as a starting point to derive a no-reference video quality metric. To change the metric into a no-reference quality model, the features SPSNR, SA and TA in Eqs.(3.9)(3.10) need to be estimated from the decoded video frames instead of from the reference video. To observe the difference of TA and SA values between the original video and the encoded video subjected to quantization artifact, experiments are conducted in which several typical test video sequences at CIF resolution (including Foreman, Mother&Daughter, etc.) are encoded with different QPs and TA and SA values are extracted from the encoded sequences. The obtained TA and SA values are shown in Figure 3.7. It can be seen that although the TA and SA values do change as a function of the QP values, the extent of change is quite limited. In our experiments, the change of TA and SA from high QP (low bit-rate) to low QP (high bit-rate) for most test sequences is no larger than 8%. This observation indicates that TA and SA extracted from decoded sequences can be seen as a good approximation to those extracted from the original sequences.

58 40 CHAPTER 3. PERCEPTUAL VIDEO QUALITY MODELING soccer cheer leader flower garten 25 mobile 20 SI coastguard foreman carphone silence 5 M&D hall monitor container TI Figure 3.7: Spatial Activity (SA) and Temporal Activity (TA) variation against bit-rate for typical test sequences It is known from rate-distortion theory that the relationship between the bit-rate and PSNR can be approximately modeled using a logarithmic function. Since our test videos have different frame rates and frame sizes, here the pixel bit-rate (bit-per-pixel) is used instead of normal bit-rate in bit-per-second: bpp = BR (3.21) F R F S where FR and FS are the frame rate and frame size respectively. Then the SPSNR in Eq.(3.6) can be estimated by: SP SNR = m ln(bpp) + n (3.22) where bpp is the pixel bit-rate and ln(x) is the natural logarithm of x. m and n are contentdependent parameters. For simplicity, the parameter n is modelled as a linear combination of SA and TA, so that it can be merged with the other items in Eq.(3.6). For the parameter m, different types of functions are examined and the power function seems to provide the best performance. Finally, the SNR quality of a video is modeled as: m = T A a0 SA a1 a 2 (3.23) 100 SNRV Q = 1 + e (m ln(bpp)+a (3.24) 3 SA+a 4 T A+a 5 )

59 3.4. DESIGN OF THE NR VIDEO QUALITY METRIC 41 where a 0,...,a 5 are model parameters which need to be trained using subjective data. Compared to the FR quality metric in STVQM, our NR quality metric has two more parameters. But all the features can be extracted from the decoded frames, which makes this metric applicable for video adaptation in the absence of the original content Performance Analysis of the SNR Quality Metric In this section, the performance of the proposed SNR quality model is evaluated against several state-of-the-art video quality metrics. According to the criteria used by VQEG in its multimedia test [VQE08], the performance of a video quality metric can be measured by the accuracy and consistency of the predictions. In [VQE08], accuracy is defined as the ability to predict the subjective quality ratings with low error while consistency is defined as the degree to which the model maintains prediction accuracy over the range of video test sequences. The Pearson Correlation (PC) and the Root Mean Square Error (RMSE) are used to measure the accuracy of a metric and the consistency is measured by the Outlier Ratio (OR). The formula to calculate the PC value has already been given in Eq.(3.11), but x i in the formula now denotes the quality prediction from the metric for video sequence i. The PC values are within the range [0,1], with 1 indicating the highest linear relationship between the model predictions and the subjective quality ratings. The RMSE value is defined as: RMSE = 1 N v [DMOS(k) P Q(k)] N v d 2 (3.25) k=1 where N v denotes the number of videos considered in the analysis, and d denotes the number of metric parameters which need to be trained from the subjective data. DMOS is the obtained quality rating from the subjective test and P Q is the predicted quality from the metrics. If the prediction of a video quality metric (P Q) deviates too far from the subjective data (DMOS), then it is considered as an outlier: DMOS(k) P Q(k) > 2 σdmos(k) Ns (3.26) where N s is the number of test subjects, and σ DMOS denotes the standard deviation of the DMOS value over all N s subjects. The OR is then calculated as the ratio of number of outliers R 0 to the total number of test videos in the analysis: OR = R 0 N v (3.27) The performance of the proposed SNR quality metric (noted as MDVQM SNR) is evaluated and compared with three other objective metrics: PSNR, SSIM [WBSS04, WLB04] and

60 42 CHAPTER 3. PERCEPTUAL VIDEO QUALITY MODELING DMOS DMOS Predicted DMOS 100 (a) MDVQM SNR Predicted DMOS 100 (b) VQMTQ SNR DMOS DMOS Predicted DMOS (c) SSIM Predicted DMOS (d) PSNR Figure 3.8: Actutal DMOS vs. predicted DMOS from the SNR quality models the SNR quality metric in VQMTQ as given in Eq.(3.1) (referred to as VQMTQ SNR). For a fair comparison, the PSNR and SSIM values are first fitted to the DMOS values measured from the subjective tests by the use of first order least-squares fitting. The linear relationship between the actual DMOS values and predicted quality values from all the four metrics are given in Figure 3.8. It can be seen that the predicted quality values from PSNR are very inaccurate due to the neglect of content characteristics. The performance of SSIM is much better by considering the structural information of the video content, but is still not very satisfactory. In comparison, the predictions from MDVQM SNR and VQMTQ SNR are more linearly correlated with the subjective ratings. The statistical metrics for performance evaluation along with the corresponding 95% confidence intervals are given in Tables The limits of the 95% confidence intervals are represented by the lower bound (LB) and upper bound (UB). The results show that in every aspect of the metric performance, MDVQM SNR provides better results than the comparison metrics.

61 3.4. DESIGN OF THE NR VIDEO QUALITY METRIC 43 To determine whether the performance of the metrics is significantly different from a statistical point of view, significance tests based on F-test are performed. For example, if the result from the significance test between two metrics is 0.95, then it can be concluded with 95% confidence that the performance difference of the two comparison metrics is statistically significant. For more information about significance tests for video quality metrics, the readers can refer to [VQE00, PW08]. The results of the significance tests are also given together with the corresponding performance metrics in Tables From the results, it can be seen that the statistical significance of the performance difference between MDVQM SNR and the comparison metrics is well above the 95% significance level for all the three performance metrics. Table 3.4: Pearson correlation values of the SNR quality metrics Metric PC LB PC UB PC Sig. Level PSNR SSIM VQMTQ SNR MDVQM SNR Table 3.5: RMSE values of the SNR quality metrics Metric RMSE LB UB RMSE RMSE Sig. Level PSNR SSIM VQMTQ SNR MDVQM SNR Table 3.6: Outlier ratios of the SNR quality metrics Metric OR CI Sig. Level PSNR SSIM VQMTQ SNR MDVQM SNR Another concern when evaluating a quality metric is the performance on unknown data. In this case, the data sets for training and validation should be separate. This is the way VQEG proceeds in their tests [VQE00, VQE03]. Compared with the subjective tests conducted by VQEG, a relatively small set of test sequences are used in our subjective tests. To overcome the problem of limited subjective data for training and verification of the quality metrics, cross

62 44 CHAPTER 3. PERCEPTUAL VIDEO QUALITY MODELING validation [Gei93, DK82] is used to evaluate the proposed metric for unknown data. There are different forms of cross validation, and the most widely used variant is the K-fold cross validation. In K-fold cross validation, the entire data set is divided into K subsets of equal size. From the K subsets, one is selected as the validation set and the other K 1 subsets are used for training the metrics. This process is repeated for K times, with each subset being used once as the validation data. In our validation, the leave-one-out cross validation (LOOCV) [Sto74], which is the simplest case of K-fold cross validation with K equals to the size of the entire dataset, is used. This means that each time one source sequence out of the training data set is excluded from the training data set and the metric parameters are trained using data from other sequences. Afterwards, the excluded data are used for validation purpose. If the proposed metric works well for all the verification sequences, it is stable and accurate. To simplify the cross validation, 5 source sequences (HA, CR, PJ, OB, FB) with different characteristics in terms of motion and spatial details are first selected. They are always kept in the training data set. For the remaining 3 sequences (SC, PA, KO), one sequence is used each time for validation and the other two are used for the training together with the other 5 sequences above. The results of the cross validation are shown in Table 3.7. Here for comparison purposes, the validation result for VQMTQ SNR is also included. Note that TA and SA values from the processed sequences are used for MDVQM SNR, while for VQMTQ SNR, the features are extracted from the reference sequences. Table 3.7: Cross validation result for the SNR metrics Test Veri.Seq. MDVQM SNR VQMTQ SNR PC RMSE PC RMSE Test1 Soccer Test2 Kobe Test3 Peda From the results, it can be seen that both metrics provide very stable and accurate predictions for the unknown data sets, with all PC values higher than The proposed MD- VQM SNR metric performs a little better with smaller RMSE values. This is in accordance with the performance evaluation results in Table The better performance of the proposed metric is due to the fact that in VQMTQ SNR, the drop rate of video quality against PSNR (which is the multiplier to the PSNR value) is modelled as a content-independent constant, which ignores the characteristics of the underlying videos. But in fact, the video content has a masking effect on the perceived video quality. Video with high spatial or temporal details can hide the negative impact of encoding noise

63 3.4. DESIGN OF THE NR VIDEO QUALITY METRIC 45 (PSNR drop) to some extent, so that the perceived quality of such videos drops more slowly when the PSNR decreases. In the proposed MDVQM SNR metric, the drop rate of video quality against the pixel bit-rate is modelled as a function of TA and SA, so that it adapts to the characteristics of the video content. The cost of this is that MDVQM SNR needs two more model parameters than VQMTQ SNR. But the better prediction performance justifies the increased complexity of the metric. When subjective ratings from all the SRCs are used to train MDVQM SNR, the obtained model parameters are given in Table 3.8a. For reference, the obtained model parameters for VQMTQ SNR are also given in Table 3.8b. They are used in the following sections for the estimation of temporal and spatial quality. Table 3.8: Model parameters trained with all the subjective ratings (a) Model parameters for MDVQM SNR (Eq.(3.24)) a 0 a 1 a 2 a 3 a 4 a (b) Model parameters for VQMTQ SNR (Eq.(3.1)) p b 0 b 1 b Temporal Quality Metric Apart from changing the QP values, another option for video adaptation is frame rate reduction. On one hand, reducing the frame rate allows us to use a smaller QP value for the compression, so that the SNR quality of encoded pictures could be improved. On the other hand, the resulting jitter/jerkiness artifacts also impair the user experience. How frame rate reduction affects the overall perceived quality is hence a very practical question to be answered. In this section, the impact of frame rate reduction is studied and a temporal video quality metric is proposed to simulate the impact Design of the Temporal Quality Metric As proposed in [OMLW11, PS11], the overall video quality in the presence of frame rate reduction is modelled as the product of two terms: T V QM = SNRV Q T CF (3.28) where SN RV Q models the SNR video quality without considering the impact of jerkiness introduced by frame rate reduction. The second item, Temporal Compensation Factor (T CF ),

64 46 CHAPTER 3. PERCEPTUAL VIDEO QUALITY MODELING models the negative impact of frame rate reduction. Normally, the value of T CF is in the range [0,1]. For the estimation of SNRVQ, the metric in Eq.(3.24) can be used. One thing to note here is that after frame rate reduction, the TA values of the adapted sequences (noted as T A r ) needed in Eq.(3.24) are different from those of the full-resolution sequences (noted as T A f ) due to the larger temporal distance of the successive frames. Since our purpose is to use this metric for video adaptation and before any real adaptation operations only T A f can be measured, T A f is used in the estimation of SNRVQ here for the temporal quality model. Of course, this may affect the accuracy of the prediction and a correction process is introduced later in this subsection. The bpp value in Eq.(3.24) can be calculated as: bpp = bpp 0 ( F R max F R ) (3.29) where bpp 0 is the pixel bit-rate when the sequence is encoded at the target bit-rate but with full temporal resolution, F R max is the original frame rate and F R is the actual frame rate after frame rate reduction. The actual video qualities of the test sequences are known from the results of the subjective test, and SNRVQ can be estimated using Eq.(3.24). In this way, the TCF is derived as: T CF = DMOS/SNRV Q (3.30) The TCF curves for different sequences are shown in Figure 3.9. The subplots correspond to different frame dropping ratios (from 2 to 4), respectively. It can be seen that the T CF is also a function of bpp. But as mentioned above, T CF actually aims to model the impact of jerkiness by frame dropping, so it should not have a very strong relationship with bpp. The reason is that the TA values of the full-resolution sequences are used instead of those of the sequences after adaptation. This can be explained with Figure In Figure 3.10, the x-axis is the pixel bit-rate and the y-axis is the SNR video quality (which is only subjected to quantization artifact without considering the impact of frame dropping). The solid blue curve is the RD-curve for the video sequence before adaptation (V o ) and the red dotted curve is the RD-curve for the video sequence after adaptation (V a ). The impact of frame dropping on the spatial details is very limited and can be neglected, so the spatial activity remains the same before and after the adaptation. But as discussed earlier, the larger temporal distance of the successive frames after frame dropping results in a higher temporal activity. In this sense V a is harder to be compressed than V o, so at the same pixel bit-rate, a coarser quantization is needed for V a and its SNR video quality is lower. This is why in Figure 3.10, the red curve is always below the blue curve. Before the adaptation, the video is encoded at a pixel bit-rate of bpp 0, corresponding to point P1 in Figure After the frame dropping, the available pixel bit-rate becomes bpp calculated by Eq.(3.29). Since

65 3.4. DESIGN OF THE NR VIDEO QUALITY METRIC Drop Ratio 4 CR (15/60) HA (15/60) SC (15/60) Drop Ratio 3 PJ (20/60) PA (10/30) TCF TCF bpp (bit/pixel) (a) Drop Ratio bpp (bit/pixel) (b) Drop Ratio Drop Ratio 2 OB (15/30) KO (15/30) FB (15/30) 0.6 TCF bpp (bit/pixel) (c) Drop Ratio 2 Figure 3.9: TCF vs. bpp without correction the TA and SA values of the full-resolution video V o are used for the estimation of SNR video quality, the predicted value still corresponds to the point on the blue curve (P2). From P1 to P2, the bpp value increases because more bits can be used to encode the pixels in the nonskipped frames. However, the actual operation point should be on the red curve at the same pixel bit-rate bpp (P3 in Figure 3.10). If the blue curve is still used for the quality estimation, a pixel bit-rate which is a little lower than bpp should be used (P4 in Figure 3.10). This means if TA and SA values of the full-resolution video V o are used in Eq.(3.24) to estimate the SNR video quality in case of frame dropping, instead of using Eq.(3.29), bpp should be calculated as: bpp = bpp 0 ( F R max F R )P T (3.31) where P T is a content-dependent parameter within the range [0,1] and P T increases as TA decreases. As an extreme case, when T A 0 (which means a static scene), frame dropping does not affect the spatial-temporal complexity anymore. In this case, the red and blue curves in Figure 3.10) should coincide with each other which means P T 1. Therefore, in this work,

66 48 CHAPTER 3. PERCEPTUAL VIDEO QUALITY MODELING Figure 3.10: Illustration of the bpp correction process an exponential function of TA is used to estimate P T : P T = e a T T A (3.32) where a T is a model parameter which needs to be trained with subjective data. From Figure 3.9, several further conclusions can be drawn: 1. The decrease of the DMOS value is smaller for low-motion video content such as HA/OB (suggested by a higher TCF value) and larger for high-motion videos such as SC/KO/FB. This indicates that the impact of frame dropping is content-dependent and it has a stronger negative impact for videos with higher temporal activity. 2. At the same reduction ratio, the impact of frame dropping is different for 60fps and 30fps sequences. This can be observed in Figure 3.9b. Actually, PJ is a sequence with much higher motion than PA, but the TCF values of PJ are higher than those of PA. This suggests that frame dropping might have a more negative impact on low frame rate videos (e.g. PA with 30fps) than high frame rate ones (e.g. PJ of 60fps). This is due to the fact that, although the reduction ratio is 3 in both cases (60fps to 20fps for PJ and 30fps to 10fps for PA), a frame rate of 10fps already causes some uncomfortable viewing experience while a frame rate of 20fps is still acceptable for most viewers. According to the two considerations above, our proposed model for TCF is given as:

67 3.4. DESIGN OF THE NR VIDEO QUALITY METRIC 49 T CF = F R 1 + b T F R max /T A F R max 1 + b T F R/T A (3.33) where b T is a model parameter which needs to be trained. It can be seen that when F R = F R max or T A 0, TCF reaches 1, which is in accordance with the fact that at full frame rate or for a static scene, the overall quality should be the same as the SNR quality. The proposed temporal quality metric in Eq.(3.28) is trained again by using Eq.(3.24) and Eq.(3.31) to estimate SNRV Q and using Eq.(3.33) for the T CF. When the temporal quality metric is fitted to subjective ratings from all 8 SRCs, the obtained model parameters are a T = and b T = Now, the T CF values can be calculated again with Eq.(3.30) and the results are shown in Figure It can be seen that the curves are much more flat and this means that the T CF can now be modelled independently of bpp which justifies the functional form of the T CF model in Eq.(3.33) Drop Ratio 4 CR (15/60) HA (15/60) SC (15/60) Drop Ratio 3 PJ (20/60) PA (10/30) TCF TCF bpp (bit/pixel) (a) Drop Ratio bpp (bit/pixel) (b) Drop Ratio Drop Ratio 2 OB (15/30) KO (15/30) FB (15/30) 0.6 TCF bpp (bit/pixel) (c) Drop Ratio 2 Figure 3.11: TCF vs. bpp with correction

68 50 CHAPTER 3. PERCEPTUAL VIDEO QUALITY MODELING Performance Analysis of the Temporal Quality Metric In this section, the performance of the proposed temporal video quality metric (referred to as MDVQM TVQ) is evaluated in the presence of frame dropping. The metric is compared with two other metrics that also consider the impact of frame rate reduction: ˆ VQMTQ TVQ: the metric in [OMLW11] (see Eqs.(3.1)-(3.3)) ˆ STVQM TVQ: the metric in [PS11] (see Eqs.(3.6)-(3.8)) For the modelling of the TCF, both MDVQM TVQ and STVQM TVQ use the feature TA extracted from the video sequence and two model parameters need to be trained. In VQMTQ TVQ, two features from the video (Motion Direction Activity (MDA) and Displaced Frame Difference (DFD)) are used and three parameters need to be trained from the subjective data. For MDVQM TVQ, the TA value is calculated from the processed videos, while for STVQM TVQ and VQMTQ TVQ, the required feature values are derived from the reference videos. For all the metrics, the model parameters are trained based on the subjective ratings of the sequences in the SNR and TR group obtained in our subjective tests. To give an intuitive view of the estimation accuracy, Figure 3.12 illustrates the linear correlation between the predicted quality values and the actual DMOS values. From the figures, it can be seen that the predictions from all the three quality metrics have a high linear correlation with the subjective ratings and the proposed MDVQM TVQ provides the best performance. These observations are confirmed and quantified by the statistical performance metrics given in Table MDVQM TVQ outperforms the other two comparison metrics with a higher PC value and a smaller RMSE value. The results from the significance test show that this observed performance difference is statistically significant. Table 3.9: Pearson correlation values of the temporal quality metrics Metric PC LB PC UB PC Sig. Level VQMTQ TVQ STVQM TVQ MDVQM TVQ Similar to the evaluation of the SNR quality metric, cross validation is performed to evaluate the metric for unknown data sets. Again, the sequences SC/KB/PA are used as verification sequences for the cross validation. Table 3.12 shows the results of the cross validation. The results confirm that the proposed temporal quality model also provides the best prediction performance for unknown data.

70 52 CHAPTER 3. PERCEPTUAL VIDEO QUALITY MODELING Table 3.12: Cross validation results for the temporal quality metrics Test V.Seq. VQMTQ TVQ STVQM TVQ MDVQM TVQ PC RMSE PC RMSE PC RMSE Test1 Soccer Test2 Kobe Test3 Peda Spatial Quality Model The third option to adapt a video stream is to reduce the spatial resolution (spatial downsampling). Reducing the frame size allows for a finer quantization because the amount of information to be compressed is reduced, thus it can alleviate several artifacts such as blocking and ringing, etc. But on the other hand, spatial down-sampling is an irreversible process and will introduce blurring into the video if it is up-sampled to the original spatial resolution. In this section, the impact of spatial resolution on the perceived video quality is studied Design of the Spatial Quality Model In [OXMW11], it was observed that the overall video quality with spatial down-sampling can be decomposed as: SV QM = SNRV Q SCF (3.34) where SNRV Q is the quality for a video which is subjected only to quantization effects. And SCF is a Spatial Correction Factor, which captures the impact of spatial down-sampling. Roughly, it can be considered that SNRV Q represents the SNR quality when the video is down-sampled, encoded and displayed at the reduced spatial resolution without resampling back to the original size. So there is no blurring effect introduced. Then SCF simulates the negative impact of the blurring effect introduced when the video is up-sampled and displayed at the original size. Similar to the case of frame rate reduction, when calculating SNRVQ using the model in Eq.(3.24), the SA values of the full-resolution (4CIF) sequences are used (because before the decision for video adaptation is made, spatial down-sampling has not been done yet and the SA value of the CIF sequence is unknown). The pixel bit-rate after transcoding with spatial down-sampling can be calculated by: bpp = bpp 0 /SF (3.35) where bpp 0 is the pixel bit-rate when the video is encoded at the target bit-rate with the original resolution. SF denotes the spatial Scaling Factor which is the ratio between the reduced and the original spatial resolution. In our case, SF = CIF/4CIF = 0.25.

71 3.4. DESIGN OF THE NR VIDEO QUALITY METRIC 53 Inserting Eq.(3.35) into Eq.(3.24), the estimated SN RV Q in Eq.(3.34) can be obtained and SCF is derived as: SCF = DMOS/SNRV Q (3.36) SCF SCF vs. BPP w/o correction CR HB SC PJ PA KO FB SCF SCF vs. BPP with correction CR HB SC PJ PA KO FB OB bpp (bit/pixel) (a) Before Correction bpp (bit/pixel) (b) After Correction Figure 3.13: SCF curves with/without correction Figure 3.13a shows the obtained SCF values by this way and it can be seen that the SCF values depend heavily on the pixel bit-rate. Similar to the design of the temporal quality metric, it is desired that the SCF is modelled independently of the bit-rate, so a correction to the bpp value is introduced for the estimation of SNRV Q: P S = e a S SA bpp = bpp 0 (SF ) P S (3.37) where a S is a model parameter which needs to be trained using subjective test results. From Figure 3.13a, it can be seen that the SCF value is content dependent. For contents with higher spatial details, the SCF value is lower, indicating that the quality of this kind of video is affected more seriously by spatial down-sampling. Based on these observations, the proposed SCF model is given as: SCF = (SF ) b S SA (3.38) where b s is a model parameter which needs to be trained. It can be seen that when SF = 1 or SA 0, SCF reaches 1, which indicates that without spatial down-sampling or for sequences with very few spatial details, the overall quality should be the same as the SNR quality. Using the subjective ratings from all 8 SRCs, the two model parameters a S and b S are trained by least-square non-linear fitting. The obtained values are a S = and b S = Figure 3.13b shows the obtained SCF values with the correction given in Eq.(3.37). It can be seen that after the correction, the curves are quite flat at middle or high bit-rate (when

72 54 CHAPTER 3. PERCEPTUAL VIDEO QUALITY MODELING bpp is greater than 0.3 bits/pixel), indicating that the SCF is independent of the bit-rate. However, at the low bit-rate end, the curves become a little irregular. This can be attributed to the difficulty in rating the videos when the quality is very low. For example, when there are very obvious artifacts in the video, it is hard to decide whether to give it a rating of 20 or 30. But these ratings do have great impact on the obtained SCF values. It is assumed that the curve will become flat and regular when the number of test subjects is larger Performance Analysis of the Spatial Quality Model In this section, the performance of the proposed spatial quality metric (referred to as MD- VQM SVQ) is evaluated and compared with that of three other quality metrics: PSNR, SSIM and the spatial quality metric proposed in [OXMW11] (see Eq. (3.4), referred to as QSTAR SVQ). Figure 3.14 shows the linear relationship between the actual DMOS and the predicted quality values from the models in comparison. Comparing Figure 3.14c with Figure 3.8c, it can be found that the SSIM index becomes less accurate in case spatial down-sampling is performed. This indicates that, although down-sampling only introduces spatial artifacts to the video, SSIM alone is uncapable of capturing this impact on video quality caused by spatial down-sampling. In comparison, the two metrics which explicitly model the impact of spatial resolution, i.e., MDVQM SVQ and QSTAR SVQ do provide much better quality prediction than SSIM and PSNR. The statistical performance metrics and results of significance tests are summarized in Table All the performance metrics indicate that the proposed spatial quality metric provides the best prediction among the metrics in comparison. The difference of performance is statistically significant, as suggested by the significance test results. Also, our SCF model requires only one video feature (SA value) and two parameters that need to be trained from the subjective ratings, while the model QSTAR SVQ requires four parameters. The better results of the proposed model comes from the fact that the impact of frame down-sampling is dependent on the characteristic of the video which is considered in MDVQM SVQ, whereas no video feature is considered in QSTAR SVQ. Table 3.13: Pearson correlation values of the spatial quality metrics Metric PC LB PC UB PC Sig. Level PSNR SSIM QSTAR SVQ MDVQM SVQ

75 3.4. DESIGN OF THE NR VIDEO QUALITY METRIC 57 and SCF. If this assumption is adopted, the ST CF can be derived from the T CF and SCF models proposed in the previous sections as: ST CF = T CF SCF (3.42) where, T CF and SCF can be calculated according to Eq.(3.33) and Eq.(3.38), respectively. To verify the accuracy of the above model, the subjective ratings from TEST II are used as the validation data set. For SNRV Q, T CF and SCF, the obtained model parameters given in Section to are used. Figure 3.15(a) shows the linear relationship between the actual subjective ratings and the predicted DMOS values using the ST CF model in Eq.(3.42) (referred to as PROD). The performance metrics of model PROD are given in the first lines of Tables From the results, it can be seen that the prediction accuracy is not satisfactory. The RMSE value is high and the predicted values are often far below the actual value as shown in Figure 3.15(a). Based on this observation, instead of directly using the product of TCF and SCF as in Eq.(3.42), three other models for the STCF are examined: ST CF = T CF SCF (3.43) ST CF = max (T CF, SCF ) (3.44) ST CF = min (T CF, SCF ) (3.45) and the overall video quality is then predicted by Eq.(3.41). Again, the subjective ratings from TEST II are used to validate these candidate models. The metric predictions and the actual DMOS values are shown in Figs.3.15(b)-(d). The performance metrics of different models are given in Tables From the results, it can be seen that the minimum function in Eq.(3.45) achieves a better overall performance than the other three comparison ST CF models with a higher PC value and lower RMSE and OR values. The results from the significance tests (also shown in Tables ) indicate that this performance difference between MIN and PROD is statistically significant, but the statistical significance of the difference among SQRT, MAX and MIN is below the typical 95% significance level. Further, the performance of the metrics is examined on each individual verification sequence as shown in Table It can be seen that, although the minimum function does not always perform the best (e.g., the performance of SQRT and MAX is better for the sequence FOOTBALL), the performance of it is much more stable than that of the other models. Based on the above observation, the minimum function given in Eq.(3.45) is selected to calculate ST CF in our overall spatial-temporal video quality metric MDVQM (see Eq.(3.41)). This indicates that when both TR and SR are reduced, the perceived video quality is mostly affected by the prevailing (more significant) distortion, either temporal or spatial.

77 3.5. SUMMARY 59 Table 3.18: RMSE values of the spatial-temporal quality metrics Metric RMSE LB UB RMSE RMSE Sig. Level PROD SQRT MAX MIN Table 3.19: Outlier ratios of the spatial-temporal quality metrics Metric OR CI Sig. Level PROD SQRT MAX MIN Table 3.20: Performance of video quality metrics using different STCF models (Eqs.( )) when both TR and SR are changed Veri.Seq. PROD SQRT MAX MIN PC RMSE PC RMSE PC RMSE PC RMSE PEDA FOOT RUSH Summary In this chapter, a no-reference objective video quality metric MDVQM is presented, which considers the impact of both spatial and temporal quality impairments on the overall perceived video quality. The metric is based on the pixel bit-rate, frame rate, frame resolution as well as spatial and temporal video features (SA and TA values) that can be easily computed from the video sequences. Different from previous works, the situation in which frame rate and frame resolution change at the same time is also investigated. Verification with the data collected from our subjective tests shows that MDVQM provides accurate predictions for the perceptual video quality. The performance is significantly better than that of the comparison metrics.

78

79 Chapter 4 Improved ρ-domain Rate Control for H.264/AVC Video In many video applications, compressed video streams are delivered under a certain rate restriction. Therefore rate control (RC) plays a very important role in order to meet the rate requirement as well as maintain a good picture quality. H.264/AVC is a widely deployed international video coding standard. By utilizing many coding options such as variable block size, intra prediction, quarter-pel motion compensation, multiple reference frames, etc., the coding efficiency is significantly improved. Compared with previous video coding standards (MPEG4 or H.263), a bit rate reduction of 50% can be achieved [WSBL03]. ρ-domain rate control [HM01, HM02a] has been shown to be simple and effective for DCT-based hybrid video codecs. When it is applied to H.264/AVC, improvements need to be made because of the large amount of header information and the QP-dependent Rate Distortion Optimization (RDO). In this chapter, new rate models to estimate the size of header information in H.264/AVC coded video streams are proposed. A two-stage rate control algorithm is presented which combines the proposed header rate model and the ρ-domain source model. In comparison with previous header rate models and rate control algorithms, the proposed approach improves the PSNR of the decoded video, meets the target bit rates more accurately and results in smaller quality fluctuation inside one frame. The remainder of this chapter is organized as follows. A review of related work is given in Section 4.1. Section 4.2 presents the proposed rate control algorithm based on the ρ-domain model. Experimental results are presented and discussed in Section 4.3. Section 4.4 gives a summary of this chapter. 61

80 62 CHAPTER 4. IMPROVED ρ-domain RATE CONTROL FOR H.264/AVC VIDEO 4.1 Related Work Although rate control is not a normative part of any video coding standards, it is an essential part of video codecs which are used in practical applications. The purpose of rate control is to maximize the video quality under certain resource constraints (such as file size, transmission rate or delay, etc.). According to whether or not the instantaneous bit-rate is allowed to vary significantly, rate control can be classified into Variable Bit-Rate (VBR) algorithms and Constant Bit-Rate (CBR) algorithms. VBR algorithms have the flexibility to allocate more resources to more complex scenes within a sequence, therefore a better video quality can be achieved. However, there are many scenarios where variation of the content complexity is not known or strict constraints are put on the instantaneous bit-rate. In these situations, CBR algorithms are used to ensure the constraints are met. VBR algorithms are often used in storage applications where the video is consumed locally and CBR algorithms are widely used in video streaming or video communication (telephony/conference) applications. In this work, considering the real-time video adaptation scenario, the focus is put on CBR situations. Nowadays, H.264/AVC is the dominant video coding standard and many rate control algorithms have been proposed for it. Most of these algorithms are based on a certain functional relationship between the bit-rate and the encoding parameters (mainly the quantization parameters). The encoding process utilizes this function to adjust the encoding parameters in order to meet the target bit-rate. For example, in [MGWL03, MLW03], Li et al. propose a rate control algorithm employing a quadratic model, which has been adopted by the Joint Video Team for its reference implementation of H.264/AVC (JM codec) [Tea]: ˆD i = a D i 1 + b (4.1) R = c ˆD QS + d ˆD QS 2 + h (4.2) where ˆD i is the predicted Mean Absolute Difference (MAD) of frame i. As shown in Eq.(4.1), ˆD i is estimated using a linear function of the actual MAD value of frame i 1 (D i 1 ). QS denotes the quantization step-size and h is the size of header information. a,b,c and d are model parameters which need to be updated with the statistics of the encoded frames. Another widely used open source implementation of H.264/AVC is X264 [X26]. In [MV07], the rate control algorithm for X264 is introduced. Before encoding a frame, motion estimation is performed on a half-resolution version of the frame and the Sum of Absolute hadamard Transformed Difference (SATD) of the residual signal is calculated as a measure of frame complexity. The initial QP value is then determined by this SATD value empirically. During the encoding process, the QP values are updated for each macroblock (MB) according to the difference between the target frame size and the actual number of bits that have been generated. The above algorithms do not utilize the frame statistics of the current frame, so they often suffer from relatively large errors in terms of rate control accuracy.

81 4.1. RELATED WORK 63 In [HM01, HM02a], He et al. observed that for DCT-based video coding (H.263/MPEG4), the coding bit-rate has a linear relationship with the percentage of coefficients which are quantized to zero: R(ρ) = θ (1 ρ) (4.3) where R denotes the coding bit-rate, ρ is the percentage of zero coefficients after quantization, and θ is a content-depend constant. To use this relationship in rate control, a one-to-one mapping between ρ and the quantization parameter is needed. This mapping can be derived from the specific quantization scheme used in the video codecs. Taking the H.263 video coding [ITU05b] as an example, the quantized coefficients are calculated as: Round( COF ) 8 : if COF is a DC coefficient in an intra-mb L = UT SQ(2q, 2q; COF ) : if COF is a AC coefficient in an intra-mb (4.4) UT SQ(2q, 2.5q; COF ) : if COF is a coefficient in an inter-mb where COF denotes the unquantized transform coefficients and q is the quantization parameter. UTSQ denotes the Uniform Threshold Scalar Quantization: 0, if c δ UT SQ[q, δ; c] = c δ q, if c > +δ (4.5) c + δ q if c < δ where δ is the dead-zone threshold. Then the relationship between ρ and q can be derived as: ρ(q) = 1 L H I (c) + 1 H P (c) (4.6) L c <2q c <2.5q where H I ( ) and H P ( ) are the histograms of the unquantized DCT coefficients for intra-coded and inter-coded MBs respectively. L is the number of coefficients in the current video frame. The ρ-domain rate control algorithm proposed in [HM02a] utilizes Eq.(4.3) and Eq.(4.6) as a rate model for the rate control of H.263 and MPEG-4 video codecs. Compared with other algorithms, the ρ-domain rate model is very simple and can provide more accurate control of the coding bit-rate. However, when it is used for H.264/AVC video coding, several issues need to be resolved first. The first issue is the inter-dependency between RDO and rate control. In H.264/AVC, up to 7 block sizes are supported for motion estimation. Small blocks improve the accuracy of motion estimation and reduce the energy of the residual signal, but leads to more motion

82 64 CHAPTER 4. IMPROVED ρ-domain RATE CONTROL FOR H.264/AVC VIDEO information (reference frame ID, motion vectors (MVs)). Therefore, a trade-off needs to be found. This is typically done by a rate-distortion optimized way: C = D + λ R mot (4.7) where C means the cost of encoding the MB, D denotes the distortion (normally calculated as the Sum of Absolute Difference (SAD)), R mot is the estimated size of the encoded motion information and λ is a Lagrange multiplier which depends on the choice of QP values [SW98]. Hence, the QP value is required by the rate-distortion optimized motion estimation. But on the other hand, ρ-domain rate control determines the QP value based on the statistics of the transformed coefficients, which requires to perform motion estimation before the selection of QP. This contradiction leads to the so-called chicken-egg-dilemma. The second issue comes from the increased amount of header information in H.264/AVC. By utilizing various coding options in H.264/AVC, the energy of the intra/inter prediction errors is significantly reduced. But at the same time, more bits are spent to signal these coding options (such as block size for an inter MB and intra prediction mode for an intra MB). At high bit-rates, the impact is not very serious since the texture information dominates the bitstream. But at low bit-rates, the header information occupies a large portion of the total bit-rate, which causes the accuracy of ρ-domain rate control to be reduced. In [HW08], He and Wu propose to use the average QP value of the previous frame for the estimation of λ in Eq.(4.7) to break the inter-dependency between rate control and RDO, so that ρ-domain rate control can be used for H.264/AVC. The authors assume that the size of header information is also proportional to ρ, so that the ρ-domain rate control originally proposed for H.263 can also be applied to H.264/AVC. However, as will be discussed in Section 4.2, this assumption is not always true, especially for low bit-rate cases. In [KSK07], Kwon et al. propose a method to estimate the size of motion information in H.264/AVC. The method is combined with the quadratic rate model in [MGWL03, MLW03] and the experimental results show that it performs better than the rate control method in JM8.1 [Tea] which uses the same source rate model. In the following, a method to estimate the size of header information in H.264/AVC is proposed. The proposed method is used together with the ρ-domain rate model to improve the accuracy of rate control for H.264/AVC codecs. A two-stage encoding structure is also employed to decouple rate control and RDO. 4.2 Proposed Rate Control Algorithm In [HW08], the ρ-domain rate control algorithm is adapted for the H.264/AVC encoder. The authors claim that for H.264/AVC video coding, the total bit-rate consumed by a frame follows a similar linear relation with ρ. However, as has been mentioned, H.264/AVC introduces

83 4.2. PROPOSED RATE CONTROL ALGORITHM 65 several advanced prediction schemes which can reduce the prediction error but the size of overhead information is also increased. Typically this header information overhead changes from frame to frame and is not addressed in the ρ-domain rate model. Figure 4.1 shows the relationship between the percentage of non-zero coefficients (NNZs) and the size of a frame. To run the experiment, the X264 encoder [X26] is used to encode the sequence FOREMAN and MOTHER&DAUGHTER with CAVLC (Context Adaptive Variable Length Coding). The sequences are encoded using different QPs from 25 to 45. The results for high bit-rates and low bit-rates are shown separately. For high bit-rates, the used QP values range from 25 to 33. For low bit-rates, the QP values are from 34 to 45. The X-axis shows (1-ρ), which is the percentage of non-zero coefficients. The Y-axis shows the number of consumed bits. The red crosses show the size of a frame and the blue dots show the size of the texture information (residual information). It can be seen that although the size of the texture information is strictly proportional to (1-ρ), the total size of a frame does not follow such a rule, especially at low bit-rates. The difference between the red and blue points is simply the size of header information in each frame. To make ρ-domain rate control more accurate for H.264/AVC, a precise estimation of the size of header information is very important. The number of header bits changes significantly for different frames and it is hard to derive a closed-form mathematic model to relate the number of header bits with the parameter ρ. In the following, selected observations from the experiments are presented and the size of the header information is estimated in an adaptive manner Header Information in H.264/AVC Header information in H.264/AVC includes the NAL (Network Abstraction Layer) header, the sequence header (PPS and SPS), the slice header and the MB header. The NAL header is rather small (1 byte per NAL unit). PPS and SPS are not sent very often. A single slice header is also very small but when a frame is divided and encoded into multiple slices, it might also occupy a certain percentage of the total frame size. Compared with the MB header, the size of a slice header is stable and easy to estimate. Normally, a slice contains either a constant number of MBs or a constant number of bits. The size of the slice header can be estimated as: R SH = N MB N MBpS b SH (fixed MB number) (4.8) R T N BpS b SH (fixed slice size) where R SH is the estimated size of slice headers of the current frame, N MB is the total number of MBs in the frame, N MBpS is the number of MBs per slice, R T is the total number of bits

85 4.2. PROPOSED RATE CONTROL ALGORITHM 67 allocated to the frame, N BpS is the number of bits allocated to each slice, and b SH is the average slice header size in the previous frames Rate Model for Inter MB Headers The MB header is the most important header information. It contains the encoding parameters for inter MBs (such as the MVs, the reference frame IDs, etc.) and for intra MBs (such as intra prediction type, etc.). Since the header of different MB types contains different information, the numbers of header bits for inter and intra MBs need to be estimated separately. In [KSK07], a linear rate model for the size of the header information in inter MBs is proposed. The authors claim that there is a strong relationship between the header size of inter MBs and the number of non-zero horizontal/vertical MVs. The authors also consider that the size of the coded block pattern (CBP bits) has a strong relationship with the number of non-zero coefficients. Specifically, the size of the header information in inter MBs is modelled as: R hdr,p = γ (N nzmv e + ω N MV ) (4.9) where ω is fixed to 0.3 for single frame motion estimation, N nzmv e is the number of non-zero motion vector elements, N MV is the total number of MVs, and γ is a parameter to estimate. The experiments in [KSK07] show that this model works well for many sequences. But our experiments show that this model does not always predict the number of header bits very accurately. In our experiments, the video sequences (250 frames long) are encoded at different bitrates using the original rate control algorithm of the X264 encoder. Figure 4.2 shows the result for the CIF sequence FOREMAN encoded at 512kbps and Mother&Daughter (M&D) encoded at 384kbps. The x-axis gives the value of (N nzmv e + ω N MV ) and the y-axis is the size of the information. The blue points correspond to the total header size and the red crosses correspond to the size of motion information. It can be seen that for both the total header size and the motion information size, the predictions provided by the model in Eq.(4.9) do not correlate well with the actual value. This estimation error results from the fact that the H.264/AVC encoder performs CAVLC not directly on MVs but on the differential MVs (MVDs), which are the differences between the actual MV and a predicted MV. So the size of the motion information should have a stronger linear relationship with the statistics of the MVDs. Let N nzmv De and N zmv De denote the number of non-zero and zero MVD elements respectively (e.g. if two MVDs (-1,0) and (0,0) are considered, then N nzmv De = 1, N zmv De = 3), a new rate model is proposed for motion information which is similar to Eq.(4.9) but based on the statistics of the MVDs:

86 68 CHAPTER 4. IMPROVED ρ-domain RATE CONTROL FOR H.264/AVC VIDEO size (bits) total header bits motion bits size (bits) total header bits motion bits N +ω*n nzmve MV N +ω*n nzmve MV Figure 4.2: Performance of the MV-based rate model in Eq.(4.9) R mot,p = γ mot (N nzmv De + ω mot N zmv De ) (4.10) where ω mot is a fixed weighting factor and γ mot is a parameter to be estimated. It is observed from our experiments that ω mot =0.2 works well for all the sequences when only one reference frame is used. In Figure 4.3, the relationship between (N nzmv De + ω mot N zmv De ) and the total header size (blue points) as well as the size of the motion information (red crosses) are presented. From the results, it can be seen that the size of the motion information can be very well predicted using the proposed model. The estimation errors of the motion information size for different sequences are presented in Table 4.1 using the R 2 value [DF99]. The R 2 value between the actual values y and the model predictions x is calculated as: n (y i x i ) 2 R 2 = 1 i=1 (4.11) n (y i y) 2 i=1 where y is the mean value of the actual data in y and n is the size of the data set. The R 2 value is used to measure the deviation between the predictions of a model and the actual data values. It takes the value from 0 to 1 and the better the prediction, the closer the R 2 value to 1. From the results in Table 4.1, it can be seen that the proposed rate model for the motion information in Eq.(4.10) provides better predictions than the model in Eq.(4.9). Another observation from the result is that, although size of the motion information can be predicted quite accurately using the proposed model, the linear relationship between

87 4.2. PROPOSED RATE CONTROL ALGORITHM size (bits) total header bits motion bits N +ω*n nzmvd zmvd size (bits) total header bits motion bits N +ω*n nzmvd zmvd Figure 4.3: Performance of MVD-based rate model in equation (4.10) Table 4.1: Performance comparison of the two rate models for motion bits in Eq.(4.9) and Eq.(4.10) Seq. Bit-rate R 2 Value (kbps) Model in Eq.(4.9) Model in Eq.(4.10) M&D (CIF) Foreman (CIF) Football (CIF) Carphone (CIF) (N nzmv De + ω mot N zmv De ) and total header size is much weaker. This estimation error results from other header information in inter MBs. As specified in H.264/AVC, an inter MB contains the following header information: ˆ Motion Information (MVDs and Refs) ˆ MB Type information (16x16, 16x8, 8x16, 8x8.) ˆ Coded Block Pattern (CBP) ˆ QP value for the MB To give an example, the average percentage of different types of header information in inter MBs for the sequence Foreman (CIFx30fps, encoded at 512kbps) is shown in Figure 4.4. From Figure 4.4, it can be seen that the size of the QP information is very limited compared

88 70 CHAPTER 4. IMPROVED ρ-domain RATE CONTROL FOR H.264/AVC VIDEO 80% 70% 60% Percentage 50% 40% 30% motion info. type info. CBP info. QP info. 20% 10% 0% Frame Number Figure 4.4: distribution of bits in the MB header to other header information. So the number of bits consumed by QP information is simply estimated using the average size of QP information per MB in the previous frames (b QP ): R qp,p = N p b QP (4.12) where, N p is the number of inter MBs in the current frame. The percentage of MB type information is a little higher but since the number of MB types for an inter MB is limited (4 for MB Type and 4 for sub-mb Type) and code sizes for each MB type or sub-mb type are fixed, once the number of MBs of each type or subtype is known, the size for this header information can be estimated by: R type,p = N p,i b type,i + N p8,j b subtype,j i A j B A = {P 16 16, P 16 8, P 8 16, P 8 8} (4.13) B = {D8 8, D8 4, D4 8, D4 4} where N p,i is the number of inter MBs of type i (as given in set A), N p8,j is the number of 8x8 blocks of sub-block type j (as given in set B), b type,i and b subtype,j are the number of bits used for encoding the corresponding MB type and sub-mb type information for inter MBs and inter-8x8 blocks, respectively. b type,i and b subtype,j are fixed values specified in the standard. For example, one bit is used for encoding the P mode and three bits are used for encoding the P 16 8 mode and the P 8 16 mode.

89 4.2. PROPOSED RATE CONTROL ALGORITHM 71 The percentage of the CBP information in different frames fluctuates heavily and for some frames it occupies a significant portion of the header size. As has been mentioned, in [KSK07], the authors consider that the size of the CBP information has a strong relationship with the size of the texture (residual) information. Based on ρ-domain theory, the size of texture information has a strong linear relationship with the percentage of non-zero coefficients (1-ρ). So this assumption can be verified by observing the relationship between (1-ρ) and the size of the CBP information. Figure 4.5 shows the experimental results for the CIF sequence FOREMAN encoded using different QP values. The X-axis shows the percentage of non-zero coefficients in a frame (1-ρ) and the Y-axis shows the size of the CBP information. It can be observed that the linear relationship is not as strong as expected. To predict the number of CBP bits based on (1-ρ) can introduce large estimation error Foreman(CIF) high rate 1000 Foreman(CIF) low rate cbp bits ρ cbp bits ρ Figure 4.5: Relationship between the percentage of non-zero coefficients (1-ρ) and the size of the CBP information for the sequence Foreman (left: high bit-rate range, right: low bit-rate range) A possible explanation for this is that the number of CBP bits depends not only on the number of non-zero coefficients but also on the distribution of these coefficients. An extreme example is that, assuming that there are six non-zero coefficients in a MB, the number of CBP bits would be different when these coefficients are evenly distributed within all the six 8x8 blocks (four luma blocks and two chroma blocks) in a MB from that when these coefficients all belong to one 8x8 block. Hence, to accurately estimate the size of the CBP information, the distribution of non-zero coefficients should also be an important factor. Let s define the zero MBs as the MBs in which all the quantized coefficients are zeros. Then inspired from the linear rate model for the motion information in [KSK07], the number of zero

90 72 CHAPTER 4. IMPROVED ρ-domain RATE CONTROL FOR H.264/AVC VIDEO and non-zero MBs can be used as an indicator of the distribution of non-zero coefficients. Let N nzmb and N zmb be the number of non-zero and zero MBs respectively, a linear rate model for the CBP information is proposed as: R cbp,p = γ cbp (N nzmb + ω cbp N zmb ) (4.14) where ω cbp and γ cbp have a function similar to that in equation (4.10). Figure 4.6 shows the relationship between (N nzmb + ω cbp N zmb ) (X-axis) and the size of the CBP information (Y-axis). ω cbp is set empirically to 0.1 based on the experiments cbp bits cbp bits N nzmb +ω*n zmb N nzmb +ω*n zmb Figure 4.6: Experimental results for Eq.(4.14) From Figure 4.6, it can be seen that there is a very strong linear relationship as suggested in Eq.(4.14). The estimation errors for different test sequences measured by the R 2 values are shown in Table 4.2. For all the sequences, the R 2 values are very close to 1, suggesting that our linear rate model for the CBP information works well. Table 4.2: Performance of the proposed rate model for CBP bits Seq. Bit-rate (kbps) R 2 Value of Eq.(4.14) M&D (CIF) Foreman (CIF) Football (CIF) Carphone (CIF)

91 4.2. PROPOSED RATE CONTROL ALGORITHM 73 The header bits for an inter MB can be modelled as: R hdr,p = R mot,p + R qp,p + R type,p + R cbp,p (4.15) where R mot,p, R qp,p, R type,p and R cbp,p are estimated using Eqs.(4.10)( ), respectively Rate Model for Intra MB Headers As there are very few intra MBs in a frame, especially at low bit-rates, and the size of the header information of intra MBs is limited compared to the size of the texture information in the same MB, the size of header information of intra MBs in a frame is estimated by: R hdr,i = N i16 16 b i N i4 4 b i4 4 (4.16) where N i16 16 and N i4 4 are the number of intra-16x16 MBs and intra-4x4 MBs, respectively. b i16 16 and b i4 4 are the average size of the header information of intra-16x16 and intra-4x4 MBs in the previous frame Two-Stage Rate Control Algorithm In this section, a two-stage rate control algorithm is proposed. In the first stage, motion estimation and mode decision are performed to collect necessary statistics of the MBs in the current frame. In the second stage, the proposed rate models for header information in Sections and are combined with ρ-domain rate control theory to accurately control the size of the current frame. 1. Frame Level Bit allocation Sophisticated frame level bit allocation algorithms can be used here. But since the main purpose is to verify the accuracy of the proposed rate models for header information, a simple frame level allocation method is used. The target size for a frame is determined by: R T = r F where r is the target bit-rate of the video stream in the unit of bits/s, and F is the frame rate in the unit of fps (frames/s). 2. Stage One: Analysis Stage In this stage, motion estimation and mode decision are conducted for all the MBs in the current frame using the average QP value of the previous frame in the RDO process. Then, the prediction residuals are transformed into the DCT domain for ρ-domain analysis. After the analysis, the model parameters N nzmv D and N zmv D are counted. 3. Stage Two: Actual Encoding Stage

92 74 CHAPTER 4. IMPROVED ρ-domain RATE CONTROL FOR H.264/AVC VIDEO a) Before encoding each MB, the size of the header information except the CBP information for inter MBs is estimated for the remaining MBs as discussed in Section and The reason why the CBP information is excluded here is that the number of zero MBs depends on the selected QP, so it is estimated later when candidate QPs are examined according to the ρ-domain rate control method. b) The estimated header size is subtracted from the remaining available bit budget for the current frame R T to determine the available bits R avail. Then all the possible QP values are examined. The size of the texture information R tex is estimated using the original ρ-domain rate control model and the size of the CBP information R cbp is estimated by Eq.(4.14). The smallest QP which results in (R cbp + R tex ) R avail for the current MB is selected. c) After encoding each MB, the bit budget R T is updated by substracting the actual number of bits used for encoding the current MB. Also, the parameters in the header rate model (γ cbp, γ mot, b QP, b i4, b i16 ) and ρ-domain rate model (θ) are also updated accordingly. d) The above procedure is repeated for all MBs in the current frame. 4. After encoding the current frame, the model parameters for the current frame are saved to be used for the first MB in the next frame. At the beginning of the first frame, default values (γ cbp = 4, γ mot = 10.3, θ = 5.4) are used to initialize the model. 4.3 Experimental Results The proposed rate control algorithm is implemented in X264, which is an open source implementation of H.264/AVC. The encoder is configured to conform to the baseline profile. CAVLC is used for entropy coding. Extensive simulations are conducted using various standard test sequences. For each sequence, 250 frames are encoded. The first frame is encoded as an I frame and the following frames are encoded all as P frames. For fair comparison, the QP value for the I frame is determined in the same manner as in the original X264 rate control (CBR mode). As mentioned above, a simple frame level bit allocation which depends on the target bit-rate and the frame rate is employed in order to examine the accuracy of the header size prediction and rate control. The proposed algorithm is compared with three other rate control algorithms: ˆ X264: The original CBR mode rate control algorithm in X264 ˆ ORIG: The original ρ-domain rate control algorithm without estimation of header information size

93 4.3. EXPERIMENTAL RESULTS 75 ˆ MVHE: ρ-domain based rate control algorithm, the rate model in [KSK07] is used to estimate the size of header information Several performance metrics are used for the evaluation: video quality in PSNR, accuracy of the rate control, QP fluctuation within one frame Video Quality in PSNR Table 4.3: Performance comparison of the rate control algorithms Seq. M&D (QCIF) Foreman (QCIF) M&D (CIF) Target Bitrate (kbps) BR (kbps) X264 ORIG MVHE Proposed PSNR (db) BR (kbps) PSNR (db) ( 0.06) (+0.05) (+0.03) ( 0.02) ( 0.07) ( 0.01) (+0.15) (+0.13) (+0.20) BR (kbps) PSNR (db) (+0.07) (+0.18) (+0.18) (+0.16) (+0.13) (+0.11) (+0.07) (+0.17) (+0.22) BR (kbps) PSNR (db) (+0.08) (+0.19) (+0.20) (+0.29) (+0.22) (+0.21) (+0.18) (+0.22) (+0.27) Table 4.3 shows a subset of the test results in terms of PSNR. The bit-rate of the encoded streams and the PSNR gain of the three ρ-domain rate control algorithms over the X264 rate control are also presented. It can be seen that for the original ρ-domain rate control, the PSNR gain is not always positive. The reason is that the X264 rate control uses a buffer to smooth the bit-rate changes for different frames, so although the overall average bit-rate of the sequence is very close to the target value, the fluctuation of frame sizes within the sequence is very large. This will be discussed soon below. For MVHE and the proposed algorithm, a positive PSNR gain can always be achieved. And the gain of the proposed algorithm is larger than that of MVHE.

94 76 CHAPTER 4. IMPROVED ρ-domain RATE CONTROL FOR H.264/AVC VIDEO Accuracy of the Rate Control According to Table 4.3, all the algorithms can control the average bit rate quite accurately with a control error smaller than 2%. But this is only one side of the story. As has been mentioned, the size of every single frame should be controlled accurately. Figure 4.7 shows the size of the first 100 P-frames for the QCIF sequence FOREMAN encoded at 192kbps, 30fps (i.e. each frame should be encoded with 400 bytes). Figure 4.7a shows the comparison between the original ρ-domain rate control and the X264 rate control. It can be seen that the frame size fluctuation of X264 is much larger than the original ρ-domain rate control, which demonstrates the advantage of ρ-domain rate control. Further, Figure 4.7b shows the size of the encoded frames for the three ρ-domain algorithms. It can be seen that the size fluctuation of MVHE and the proposed algorithm is smaller than that of the original ρ-domain rate control. This improvement is due to the accurate estimation of the header size. Table 4.4 presents the average deviation of the actual frame size from the target frame size. It can be seen that for all the test sequences, the proposed algorithm gives the smallest deviation, which means it can control the frame size most accurately. Table 4.4: Average deviation of the actual frame size from the target frame size Seq. Foreman (QCIF) M&D (CIF) Football (CIF) Foreman (CIF) Target Bitrate (kbps) Target Frame Size (Byte) Dev. (Byte) ORIG MVHE Proposed Percent (%) Dev. (Byte) Percent (%) Dev. (Byte) Percent (%) QP Fluctuation among MBs within a Frame All the MB level rate control algorithms allow the QP to be adjusted for each MB to meet the target frame size accurately. On the other hand, if the QP changes too frequently, more bits need to be spent to signal the QP changes between successive MBs in the bitstream. Also, higher QP variation causes more significant quality fluctuations within a frame, which might be annoying sometimes. Figure 4.8 shows an example of QP variation within one frame for the three ρ-domain rate control algorithms. It can be seen that the proposed algorithm (green

96 78 CHAPTER 4. IMPROVED ρ-domain RATE CONTROL FOR H.264/AVC VIDEO 30 Foreman(QCIF) F50 29 Foreman(QCIF) F qp orig MVHE proposed qp orig MVHE proposed MB index MB index Figure 4.8: Comparison of QP fluctuation within one frame line with star points) results in much smaller QP variation within the frame compared with the original ρ-domain rate control (blue line with diamond points) and MVHE (red line with round points). Table 4.5 shows the average variance of the QP values within a frame. The average maximum difference between the QP values within a frame (maximum difference is the difference between the largest QP and smallest QP within a frame) is also given. The results show that for most of the cases, the proposed algorithm results in a smaller variance and maximum difference, which indicates again a smaller fluctuation of QP values within a frame. The smaller variation allows us to spend more bits on the residual signal and improve the picture quality. It also proves that the proposed algorithm predicts the total size of header and residual information more accurately than the other two algorithms, so that an appropriate QP is selected from the beginning and does not need to be changed for the last MBs dramatically to meet the target frame size. In summary, compared with the other rate control algorithms, our proposed algorithm gives the best video quality, the smallest frame size control error and the smallest QP variation within a frame. This also proves the effectiveness of the proposed header estimation method. 4.4 Summary In this chapter, an efficient rate control algorithm for H.264/AVC with accurate header information estimation is proposed. The approach uses a two-stage encoder structure to resolve the inter-dependency between RDO and ρ-domain rate control. The header information is estimated using an improved rate model which considers different components in a MB header (type information, motion information, CBP information, etc.). Experimental results show the proposed algorithm can achieve better rate control accuracy and video quality compared

99 Chapter 5 QoE-driven Multi-Dimensional Video Adaptation In Chapter 3 and Chapter 4, the two most important components, i.e. perceptual quality estimation and rate control, of the video adaptation system shown in Figure 1.2 are studied respectively. Based on these studies, a QoE-driven MDA scheme is proposed in this chapter to determine the optimal combination of different adaptation operations and optimize the perceptual video quality. The QoE is estimated based on the objective quality metric MDVQM presented in Chapter 3, which has been shown to provide a good estimation of perceptual quality for videos in the presence of both spatial and temporal impairments. The presented QoE-driven video adaptation scheme automatically examines the impact of different adaptation strategies and makes the best decision for video adaptation. The ρ-domain rate control algorithm presented in Chapter 4 is also integrated into the system for accurate rate adaptation. 5.1 System Overview As mentioned in Chapter 3, three major parameters can be adjusted to perform video adaptation: the quantization parameter, the temporal resolution of the video (frame rate) and the spatial resolution of the video (frame size). They affect the SNR, temporal and spatial quality of the video, respectively. The contribution of the three quality measures to the overall perceived video quality depends heavily on the characteristics of the video content. The aim of a QoE-driven video adaptation scheme is to find a compromise among different quality measures to maximize the perceived video quality (QoE) for constrained system resources. Therefore, in the proposed video adaptation scheme, three operating modes are considered: 81

100 82 CHAPTER 5. QOE-DRIVEN MULTI-DIMENSIONAL VIDEO ADAPTATION ˆ SNR Mode: The video adaptation algorithm does not change the spatial and temporal resolution of the video. Rate adaptation is performed by adjusting only the QP values. ˆ Temporal Mode (T-Mode): The video adaptation algorithm reduces the frame rate to maintain a good SNR quality of the encoded frames. For T-Mode, there are different sub-modes corresponding to different ratios of frame rate reduction (e.g., to reduce the frame rate to 1/2, 1/3 or 1/4 of the original frame rate in our demo implementation). The sub-mode providing the best quality is selected for T-Mode. ˆ Spatial Mode (S-Mode): The video adaptation algorithm reduces the spatial resolution of the video. For simplicity, the down-sampling factor is fixed to 2 both horizontally and vertically in our demo implementation, so that the frame size is 1/4 of the original size. Figure 5.1 shows the workflow of the proposed adaptation scheme. The incoming video stream is decoded and a scene change detector is applied on the decoded frames. In the presence of a scene change, the first frame after the scene change is encoded as an intra frame and the current adaptation mode will be kept unchanged (left path in Figure 5.1). Otherwise, the normal mode decision process is conducted as follows (right path in Figure 5.1). The decoded frame is first used to calculate the spatial and temporal complexity measures (TA/SA) required by the video quality metric (see Eq.(3.10)(3.9)). In order to make globally optimal decisions, the TA and SA values are averaged over a window of 5 frames. The mean TA/SA values are used to estimate the resulting video quality as proposed in Chapter 3. Note that if there is a scene change, the mean TA/SA values are reset. This is because different scenes have quite different spatial and temporal characteristics and by resetting the TA/SA values, a more accurate estimation for the current scene can be achieved. Then, the elapsed time since the last adaptation mode change is compared with a threshold waiting time (referred to as T wait ). T wait is used to control the minimum duration the algorithm needs to wait before it is allowed to change the adaptation mode again. The purpose for introducing T wait is to avoid too frequent mode changes. The considerations are two-folds: Firstly, jumping between different adaptation modes may cause jitter effects of the frame quality and affect the user experience. Secondly, the change of spatial resolution requires to reset the encoder status and the first frame after this kind of adaptation mode change must be encoded as an intra frame. If this happens too frequently, the number of intra frames will increase and the coding efficiency of the adapted video stream will be affected. In the implementation, a waiting period of 1 second is used. If the elapsed time is longer than T wait, quality estimation and mode decision is conducted to select the best adaptation mode. Otherwise, the current adaptation mode is kept unchanged.

102 84 CHAPTER 5. QOE-DRIVEN MULTI-DIMENSIONAL VIDEO ADAPTATION For the quality estimation, the mean TA/SA values are passed to the proposed quality metric in Section 3.4. The estimated quality values for the sub-modes of the temporal mode are first calculated and compared with each other to determine the optimal frame rate for the temporal mode (Section 3.4.2). The estimated quality value for the temporal mode (referred to as T V Q best ) is then compared with the estimated video quality for the spatial mode (SVQ, Section 3.4.3) and for the SNR mode (SNRVQ, Section 3.4.1) to determine the optimal adaptation mode. The adaptation mode is changed only if 5 successive frames indicate the same optimal adaptation mode. The consideration here is also to avoid frequent mode changes as mentioned earlier. After the whole mode decision process mentioned above, the adaptation mode for the current frame is determined and the encoder re-encodes the current frame according to the decision. 5.2 Performance Evaluation of the Video Adaptation Scheme To evaluate the performance of the QoE-driven video adaptation algorithm, a prototype system with a cascaded transcoding architecture is implemented based on the open source X264 video codec. Since the spatial adaptation mode changes the resolution of the frames, image scaling needs to be implemented to down- and up-sample the frames. In our implementation, the image scaling algorithm proposed in [Rie] is adopted. For image down-sampling, it simply calculates the average of four adjacent source pixels to get the target pixel value. For image up-sampling, it uses a directional interpolation method, which performs the interpolation according to the direction of the gradient of each pixel so that the interpolation is done along the edge instead of across the edge. This can avoid the typical strong blurring artifacts introduced by bi-linear interpolation. Also, the rate control algorithm in the original X264 codecs is replaced with the ρ-domain rate control algorithm introduced in Chapter 4 to achieve a better accuracy for rate adaptation. The developed transcoder prototype is used to adapt videos containing different types of content. In the following, three test videos are selected to test the performance of the demo system. ˆ The first one is a combination of 5 standard test sequences (PEDESTRIAN, OBAMA, RUSH HOUR, FOOTBALL, KOBE) with 30fps frame rate (referred to as STAN- DARD 30fps). Each sub-scene contains 300 frames (corresponds to a duration of 10 seconds), so the total length is 50 second. ˆ The second one is a combination of 5 standard test sequences (HARBOUR, SOCCER,

103 5.2. PERFORMANCE EVALUATION OF THE VIDEO ADAPTATION SCHEME 85 PARKJOY, SHIELDS, CROWD RUN) with 60fps frame rate (referred to as STANDARD 60fps). Each sub-scene contains 500 frames. The third test sequence is a video clip of sport news from BBC downloaded from Youtube (referred to as BBCNEWS 30fps). The frame rate of this video is 30fps. The sub-scenes in this video are of diﬀerent duration. Example scenes from the test videos are shown in Figure 5.2. The scenes are arranged according to their order in the test videos. (a) STANDARD 30fps (b) STANDARD 60fps (c) BBCNEWS 30fps Figure 5.2: Example frames of test sequences These videos are ﬁrst encoded at a relatively high bit-rate so that the output video has a very good quality (which can avoid the impact of the quality of the source videos on the performance evaluation of the adaptation scheme). The developed transcoder is used to adapt the videos to a relatively low bitrate. The encoding bit-rates of the input and output video streams are summarized in Table 5.1. Adapted videos employing four diﬀerent adaptation strategies are generated for comparison: Proposed adaptive mode decision scheme, which adaptively selects the best video adap- tation mode.

104 86 CHAPTER 5. QOE-DRIVEN MULTI-DIMENSIONAL VIDEO ADAPTATION ˆ SNR-only scheme, which only uses SNR-mode for adaptation. ˆ TR-only scheme, which only uses T-mode for adaptation. ˆ SR-only scheme, which only uses S-mode for adaptation. Table 5.1: Encoding bit-rates of the input and output video streams for performance evaluation test video input bitrate output bitrate (kbps) (kbps) BR1 BR2 BR3 BR4 STANDARD 30fps STANDARD 60fps BBCNEWS 30fps To compare the performance of different adaptation schemes, the quality improvement is measured quantitatively as: IR = V Q QoE V Q na 1 (5.1) where V Q QoE is the resulting mean video quality of the proposed QoE-driven adaptation scheme and V Q na is the mean video quality when one of the three non-adaptive adaptation schemes is used. proposed in Chapter 3. The video quality is estimated using the video quality metric MDVQM Tables show the quality improvement of the proposed adaptation scheme against the non-adaptive strategies when adapting the original video stream to different lower bitrates. The improvement is measured in two ways. The first one is the improvement of the mean quality value over the whole video. This is referred to as overall quality improvement. The second one is the improvement of the mean video quality over the periods in which different decisions are made by the comparison schemes. For example, when the proposed adaptive scheme is compared with SNR-only scheme, only the parts in the video where the algorithm uses an adaptation mode other than SNR mode is considered. In Table , this is referred to as optimized part. From the results in Table , it can be seen that compared with the conventional SNR-only scheme (which changes only QP for video adaptation), the proposed QoE-driven adaptive scheme can achieve an overall quality improvement of up to 10%. Also, the overall improvement of video quality decreases when the target adaptation bit-rate gets higher. This is easy to understand, because when the temporal/spatial complexity of the video content is low or the target bit rate is relatively high, then there is no need to do any special adaptation operations other than changing the QP. In this case, the overall quality improvement against

106 88 CHAPTER 5. QOE-DRIVEN MULTI-DIMENSIONAL VIDEO ADAPTATION only (Figure 5.3b) and spatial mode only (Figure 5.3c). The red curves show the quality change when the QoE-driven adaptation scheme is used and the blue curves show the case when the other strategies are used. Sometimes the curves overlap because the QoE-driven adaptation scheme has chosen the same adaptation mode as the one chosen by the comparison scheme. It can be seen that the red curves are always higher than or overlap with the blue curves, which proves that the proposed QoE-driven adaptation scheme can adaptively make the best adaptation decision to optimize the video quality. Figure 5.4 and Figure 5.5 show the change of video quality over time for the other two test videos (i.e. STANDARD 60fps and BBCNEWS 60fps). In Figure 5.3 and Figure 5.4, the period of different sub-scenes contained in the test videos is marked by showing example pictures of the sub-scenes to make the figures more intuitive. From the results, it can be seen that for 30fps sequences, the temporal mode is not used very often. Most of the time, the transcoder operates in SNR- or spatial-mode. The reason for this is that 30fps is already a threshold below which the human visual system tends to recognize the jerkiness caused by frame dropping. So reducing the frame rate from 30fps to 15fps or lower is not a preferred way for the video adaptation. This behavior of the transcoder accords with the result from the subjective tests. On the other hand, for 60fps sequences, since the original frame rate is relatively high, the impact of frame rate reduction from 60fps to 30fps is not very significant. So the temporalmode is used more often than for the 30fps sequences. This observation can also be confirmed by the results in Table : For the two 30fps test videos, the performance of the TR-only scheme is much worse than the SNR-only scheme (which is proved by a much higher overall quality loss against the proposed scheme of the TR-only scheme than the SNR-only scheme). However, according to the results for the test video STANDARD 60fps (Table 5.4), the overall performance of the TR-only scheme is very close to that of the SNR-only scheme, sometimes even better especially at low bit-rates. This indicates for videos with a relatively high frame rate (e.g. 50/60fps), the temporal-mode (reducing the frame rate) could be a preferred choice for adaptation. Figure 5.6a shows sample frames from BBCNEWS 30fps adapted using the proposed adaptive QoE-driven scheme (right) and the conventional SNR-only scheme (left). For this frame, the adaptive scheme chooses to encode the frame in SR-mode (which means the frame is down-sampled and then encoded at the lower resolution) while the SNR-only scheme choose to adjust simply the quantization parameter. The achieved quality improvement can be seen clearly. In Figure 5.6b and Figure 5.6c comparisons of adapted frames for test videos STAN- DARD 30fps and STANDARD 30fps are also presented. Similar quality improvements can also be observed from the pictures. The above results have shown the effectiveness of the proposed QoE-driven adaptation

110 92 CHAPTER 5. QOE-DRIVEN MULTI-DIMENSIONAL VIDEO ADAPTATION scheme. 5.3 Summary In this chapter, a QoE-driven video adaptation scheme is proposed to adaptively make the optimal adaptation decision according to the characteristics of the video content and the channel resources. Three adaptation modes, i.e. SNR Mode, Temporal Mode and Spatial Mode, are considered in the proposed scheme. The metric MDVQM proposed in Chapter 3 is used to estimate perceived video quality and select the suitable adaptation operations. The frequency of mode changes is restricted to avoid the spatial and temporal flickering effect. The performance of the proposed scheme is evaluated using video sequences with different characteristics. The results show that the proposed adaptive MDA scheme outperforms all the other non-adaptive approaches. Compared with the traditional adaptation scheme where only QP is adjusted, the improvement of the overall perceived video quality is up to 10%.

Performance Analysis and Comparison of 15.1 and H.264 Encoder and Decoder K.V.Suchethan Swaroop and K.R.Rao, IEEE Fellow Department of Electrical Engineering, University of Texas at Arlington Arlington,

A Complete Approach for Quality and Service Assurance W H I T E P A P E R Introduction Video service providers implement new technologies to maximize the quality and diversity of their entertainment program

Video compression: Performance of available codec software Introduction. Digital Video A digital video is a collection of images presented sequentially to produce the effect of continuous motion. It takes

BRNO UNIVERSITY OF TECHNOLOGY Faculty of Electrical Engineering and Communication Department of Radio Electronics Ing. Martin Slanina METHODS AND TOOLS FOR IMAGE AND VIDEO QUALITY ASSESSMENT METODY A PROSTŘEDKY

In the middle of the 1980, the telecommunications world started the design of a network technology that could act as a great unifier to support all digital services, including low-speed telephony and very

Performance Evaluation of VoIP Services using Different CODECs over a UMTS Network Jianguo Cao School of Electrical and Computer Engineering RMIT University Melbourne, VIC 3000 Australia Email: j.cao@student.rmit.edu.au

Ad Insertion within a statistical multiplexing pool: Monetizing your content with no compromise on picture quality Pascal Jezequel, May 2013 Operators or broadcasters can increase their ad revenue by specifically

Gaming as a Service Prof. Victor C.M. Leung The University of British Columbia, Canada www.ece.ubc.ca/~vleung International Conference on Computing, Networking and Communications 4 February, 2014 Outline

Video Compression Djordje Mitrovic University of Edinburgh This document deals with the issues of video compression. The algorithm, which is used by the MPEG standards, will be elucidated upon in order

Burapha University ก Department of Computer Science 12 Quality of Service (QoS) Quality of Service Best Effort, Integrated Service, Differentiated Service Factors that affect the QoS Ver. 0.1 :, prajaks@buu.ac.th

WHITE PAPER H.264/AVC Encode Technology V0.8.0 H.264/AVC Standard Overview H.264/AVC standard was published by the JVT group, which was co-founded by ITU-T VCEG and ISO/IEC MPEG, in 2003. By adopting new

When the security industry began the transition from using VHS tapes to hard disks for video surveillance storage, the question of how to compress and store video became a top consideration for video surveillance

Videoconferencing Glossary Algorithm A set of specifications that define methods and procedures for transmitting audio, video, and data. Analog Gateway A means of connecting dissimilar codecs. Incoming

Using AVC/H.264 and H.265 expertise to boost MPEG-2 efficiency and make the 6-in-6 concept a reality A Technical Paper prepared for the Society of Cable Telecommunications Engineers By Anais Painchault

TCOM 370 NOTES 99-4 BANDWIDTH, FREQUENCY RESPONSE, AND CAPACITY OF COMMUNICATION LINKS 1. Bandwidth: The bandwidth of a communication link, or in general any system, was loosely defined as the width of

2009 International Symposium on Computing, Communication, and Control (ISCCC 2009) Proc.of CSIT vol.1 (2011) (2011) IACSIT Press, Singapore Analysis of IP Network for different Quality of Service Ajith

ISO/IEC MPEG USAC Unified Speech and Audio Coding MPEG Unified Speech and Audio Coding Enabling Efficient Coding of both Speech and Music The standardization of MPEG USAC in ISO/IEC is now in its final