A HYBRID TRANSFORMATION TECHNIQUE FOR ADVANCED VIDEO CODING M. Ezhilarasan, P. Thambidurai Department of Computer Science & Engineering and Information Technology, Pondicherry Engineering College, Pondicherry – 605 014, India
[email protected]
ABSTRACT A Video encoder performs video data compression by having combination of three main modules such as Motion estimation and compensation, Transformation, and Entropy encoding. Among these three modules, transformation is the module of removing the spatial redundancy that exists in the spatial domain of video sequence. Discrete Cosine Transformation (DCT) is the defacto transformation method in existing image and video coding standards. Even though the DCT has very good energy preserving and decorrelation properties, it suffers from blocking artifacts. To overcome this problem, a hybridization method has been incorporated in transformation module of video encoder. This paper presents an hybridization in the transformation module by incorporating DCT as transformation technique for inter frames and a combination of wavelet filters for intra frames of video sequence. This proposal is also applied in the existing H.264/AVC standard. Extensive experiments have been conducted with various standard CIF and QCIF video sequences. The results show that the proposed hybrid transformation technique outperforms the existing technique used in the H.264/AVC considerably. Keywords: Data Compression, DCT, DWT, Video Coding, Transformation.
1
INTRODUCTION
Transform coding techniques have become the important paradigm in image and video coding standards, in which the Discrete Cosine Transform (DCT) [1][2] is applied due to its high decorrelation and energy compaction properties. In the past two decades, more contributions focused on Discrete Wavelet Transform (DWT) [3][4] for its performance in image coding. The two most popular techniques such as DCT and DWT are well applied on image and video coding applications. International Organization for Standards / International Electro technical Commission (ISO/IEC) and International Telecommunications Union – Telecommunication standardization sector (ITU-T) organizations have developed their own video coding standards viz., Moving Picture Experts Group MPEG-1, MPEG-2, MPEG-4 for multimedia and H.261, H.263, H.263+, H.263++, H.26L for videoconferencing applications. Recently, the MPEG and the Video Coding Experts Group (VCEG) have jointly designed a new standard namely, H.264 / MPEG-4 (Part-10) [5] for providing better compression of video sequence. There has been a tremendous contribution by researchers, experts of various institutions and research laboratories for the past two decades to take up the recent technology requirements in the video coding standards.
In Advanced Video Coding (AVC) [6], video is captured as a sequence of frames. Each frame is compressed by partitioning it as one or more slices, where each slice consists of sequence of macro blocks. These macro blocks are transformed, quantized and encoded. The transformation module converts the frame data from time domain to frequency domain, which intends to decorrelate the energy (i.e., amount of information present in the frame) present in the spatial domain. It also converts the energy components of the frame into small numbers of transform coefficients, which are more efficient for encoding rather than their original frame. Since the transformation module is reversible in nature, this process does not change the information content present in the source input signal during encoding and decoding process. By information theory, transformed coefficients are reversible in nature. As per Human Visual System (HVS), human eyes are highly sensitive on low frequency signals than the high frequency signals. The decisive objective is this paper is to develop a hybrid technique that achieves higher performance in the parameters specified above than the existing technique used in the current advanced video coding standard. In this paper, a combination of orthogonal and bi-orthogonal wavelet filters have been applied at
Ubiquitous Computing and Communication Journal
1
different decomposition levels for intra frames and DCT for inter frames of video encoder. Even though intra frames are coded with wavelet transform, the impact of this can be seen in inter frame coding. With better quality anchor pictures are retained in frame memory for prediction, the remaining inter frame pictures are more efficiently coded with DCT. The proposed transformation method is also implemented on H.264/AVC reference software [7]. The paper is organized as follows. In Section 2, the basics of the transform coding methods are highlighted. The proposed hybrid transformation technique has been described in section 3. Extensive experimental results and discussion have been given in section 4 followed by conclusion in section 5.
2.1
2
2.1.1 Discrete Cosine Transform The Discrete Cosine Transform, a widely used transform coding technique in image and video compression algorithms. It is able to perform decorrelation of the input signal in a data-independent manner. When an image or a frame is transformed by DCT, it is first divided into blocks, typically of size of 8x8 pixels. These pixels are transformed separately without any influence from the other surrounding blocks. The top left coefficient in each block is called the DC coefficient, and is the average value of the block. The right most coefficients in the block are the ones with highest horizontal frequency, while the coefficients at the bottom have the highest vertical frequency. This implies that the coefficient in the bottom right corner has the highest frequencies of all the coefficients. The forward DCT of a discrete signal for original image f(i,j) for (MxN) block size and inverse DCT (IDCT) of reconstructed image f% (i, j) for the same (MxN) block size are defined as
BASICS OF TRANSFORM CODING
For any inter-frame video coding standards, the basic functional modules are motion estimation and compensation, transformation, quantization and entropy encoder. As shown in the Fig. 1, the temporal redundancies exists in successive frames are minimized or reduced by motion estimation and compensation module. The residue or the difference between the original and motion compensated frame is applied into the sequence of transformation and quantization modules. The spatial redundancy exists in neighboring pixels in the image or intra-frame is minimized by these modules.
Basics of Transformation
From the basic concepts of information theory, coding of symbols in vectors is more efficient than coding of symbols in scalars [8]. By using this phenomenon, a group of blocks of consecutive symbols from the source video input are taken as vectors. There is high correlation in the neighboring pixels in an image or intra-frame of video. Transformation is a reversible model [9] by theory, which decorrelates the symbols in the given blocks. In the recent image and video coding standards the following transformation techniques are applied due to their orthonormal property and energy compactness.
F(u,v) =
2C(u)C(v) M −1 N −1 (2i + 1)uπ (2 j + 1)vπ ∑ ∑ cos 2M cos 2 N f (i , j ) MN i =0 j = 0
2C(u)C(v) (2i + 1)uπ (2 j + 1)vπ % f(i,j) = ∑∑ cos cos F (u , v ) 2M 2N MN u = 0 v =0 M −1 N −1
Figure 1: Basic Video encoding module. The transformation module converts the residue symbols from time domain into frequency domain, which intends decorrelate the energy present in the spatial domain. This is so appropriate for quantization. Quantized transform coefficients and motion displacement vectors obtained from motion estimation and compensation module are applied into entropy encoding (Variable Length Coding) module, where it removes the statistical redundancy. These modules are briefly introduced as follows.
(1)
(2)
Where i, u = 0,1,…,M-1, j, v = 0,…,N-1 and the constants C(u) and C(v) are obtained by C ( x) =
2 2
=1
if x = 0 otherwise
MPEG standards apply DCT for video compression. The compression exploits spatial and temporal redundancies which occur in video objects or frames. Spatial redundancy can be utilized by simply coding each frame separately. This technique is referred to as intra frame coding. Additional compression can be achieved by taking advantage of the fact that consecutive frames are often almost identical. This temporal compression has the
Ubiquitous Computing and Communication Journal
2
potential for a major reduction over simply encoding each frame separately, but the effect is lessened by the fact that video contains frequent scene changes. This technique is referred to as inter-frame coding. The DCT and motion compensated Inter-frame prediction are combined. The coder subtracts the motion-compensated prediction from the source picture to form a ‘prediction error’ picture. The prediction error is transformed with the DCT, the coefficients are quantized using scalar quantization and these quantized values are coded using an arithmetic coding. The coded luminance and chrominance prediction error is combined with ‘side information’ required by the decoder, such as motion vectors and synchronizing information, and formed into a bit stream for transmission. This technique works well with a stationary background and a moving foreground since only the movement in the foreground is coded. Despite all the advantages of JPEG and MPEG compression schemes based on DCT namely simplicity, satisfactory performance, and availability of special purpose hardware for implementation; these are not without their shortcomings. Since the input image needs to be “blocked,” correlation across the block boundaries is not eliminated. The result is noticeable and annoying “blocking artifacts” particularly at low bit rates. 2.1.2 Discrete Wavelet Transform Wavelets are functions defined over a finite interval and having an average value of zero. The basic idea of the wavelet transform is to represent any arbitrary function as a superposition of a set of such wavelets or basis functions. These basis functions or child wavelets are obtained from a single prototype wavelet called the mother wavelet, by dilations or scaling and translations. Wavelets are used to characterize detail information. The averaging information is formally determined by a kind of dual to the mother wavelet, called the scaling function φ (t). The main concept of wavelets is that at a particular level of resolution j, the set of translates indexed by n form a basis at that level. The set of translates forming the basis at the j+1 next level, a coarser level, can all be written as a sum of weights times the level-j basis. The scaling function is chosen such that the coefficients of its translation are all necessarily bounded. The scaling function, along with its translation, forms a basis at the coarser level j+1 but not level j. Instead, at level j the set of translates of the scaling function φ (t) along with the set of translates of the
φ (t) do form a basis. Since the set of translates of the scaling function φ (t) at a coarser
mother wavelet
level can be written exactly as a weighted sum of translates at a finer level, the scaling function must satisfy the dilation function
φ (t ) = ∑ 2h0 [n]φ (2t − n)
(3)
n∈Z
The dilation function is recipes for finding a function that can be build from a sum of copies of itself that are scaled, translated, and dilated. Equation (3) expresses a condition that a function must satisfy to be a scaling function and at the same time forms a definition of the scaling vector h0. The wavelet at the coarser level is also expressed as (4) ψ (t ) = ∑ 2h1 [ n]φ (2t − n) n∈Z
The discrete high-pass impulse response h1[n], describing the details using the wavelet function, can be derived from the discrete low-pass impulse response h0[n] using the following equation. (5) h1 [n] = (−1) n h0 [1 − n] The number of coefficients in the impulse coefficients in the impulse response is called the number of taps in the filter. For orthogonal filters, the forward transform and its inverse are transposing of each other, and the analysis filters are identical to the synthesis filters. 2.2
Quantization A Quantizer [10][11] simply reduces the number of bits needed to store the transformed coefficients by reducing the precision of those values. Since this is a many to one mapping, it is a lossy process and is the main source of compression in an encoder. Quantization can be performed on each individual coefficient, which is referred as scalar quantization. Quantization can also be performed on a group of coefficients together, and which is referred as vector quantization. Uniform quantization is a process of partitioning the domain of input values into equally spaced intervals, except outer intervals. The end points of partition intervals are called the quantizer decision boundaries. The output or reconstruction value corresponding to each interval is taken to be the midpoint of the interval. The length of each interval is referred to as the step size (fixed in the case of uniform quantization), denoted by the symbol ∆. The step size ∆ is given by ∆=
2X max M
(6)
Where M = number of level of quantizer, Xmax is the maximum range of input symbols. In this work, a quantizer used in H.264 has been considered for inter-frame motion compensated predictive coding, which allows acceptable loss in quality for the given video sequences. 2.3
Motion Estimation Motion estimation (ME) [12] is a process to estimate the pixels of the current frame from reference frame(s). Block matching motion estimation or block matching algorithm (BMA), which is temporal redundancy removal technique
Ubiquitous Computing and Communication Journal
3
between two or more successive frames, is an integral part for most of the motion compensated video coding standards. Frames are being divided into regular sized blocks, so referred as macro blocks (MB). Block-matching method is to find the bestmatched block from the previous frame. Based on a block distortion measure (BDM), the displacement of the best-matched block will be described as the motion vector (MV) to the block in the current frame. The best match is usually evaluated by a cost function based on a BDM such as Mean absolute difference (MAD) defined as MAD(i, j) =
1 M −1 N −1 | c( x + k , y + l ) − p( x + k + i, y + l + ∑ MN ∑ k =0 l =0
j) |
(7) where M x N is the size of the macro block, c(.,.) and p(.,.) denote the pixel intensity in the current frame and previously processed frames respectively, (k,l) is the coordinates of the upper left corner of the current block, and (x,y) represents the displacement in pixel which is relative to the position of current block. After checking each location in the search area, the motion vector is then determined as the (x,y) at which the MAD has the minimum value. In this wok, an exhaustive full search has been applied for motion compensated prediction technique. 2.4
Entropy Encoding Based on scientist Claude E. Shannon [8], the entropy η of an information source with alphabet S = {s1, s2, …, s3} is defined as n
η = H ( S ) = ∑ pi log 2 i =1
1 pi
(8) Where pi is the probability of symbol si in S. The term log2 1 indicates the amount of information pi
contained in si, which corresponds to the number of bits needed to encode si. An entropy encoder further compresses the quantized values to give better compression ratio. It uses a model to accurately determine the probabilities for each quantized value and produces an appropriate code based on these probabilities so that the resultant output code stream will be smaller than the input stream. The most commonly used entropy encoders are the Huffman encoder [13] and the arithmetic encoder [14]. It is important to note that a properly designed entropy encoder is absolutely necessary along with optimal signal transformation to get the best possible compression. Arithmetic coding is a more modern coding method that usually outperforms Huffman coding in practice. In arithmetic coding, a message is represented by an interval of real numbers between 0 and 1. As the message becomes longer, the interval needed to represent it becomes smaller, and the number of bits needed to specify that interval grows. Successive symbols of the message reduce the size of the interval in accordance with the symbol probabilities generated by the model. The arithmetic
coding is more complex than Huffman coding on its implementation. CAVLC used in H.264 has been considered in the experiments for entropy encoding process. 2.5
Motivation for this work DCT is best transformation technique for Motion estimation and compensated predictive coding models. Due to blocking artifacts problems encountered in DCT, sub band coding methods are considered as an alternative for this problem. DWT is the best alternative method because of its energy compaction and preservation property. Due to ringing artifacts incurred in DWT, there is a tremendous contribution from the researchers, experts from various institutes and research labs for past two decades. In addition to the transformation module, In DCT-based Motion compensated Predictive (MCP) [15] coding architecture, previously processed frames are considered as reference frames to predict the future frames. Even though the transformation module is energy preserving and lossless module, it is irreversible in experiments. Subsequently the transformed coefficients are quantized to achieve higher compression leads further loss in the frame, which are to be considered as reference frames stored in frame memory for future frame prediction. Decoded frames are used for the prediction of new frames as per the MCP coding technique. JPEG 2000 [16] proved that high quality image compression can be achieved by applying DWT. This motivates us to have a combination of orthogonal and bi-orthogonal wavelet filters at different level of decompositions for intra frames and DCT for inter frames of video sequence. 3
PROPOSED HYBRID TRANSFORMATION WITH DIFFERENT COMBINATION OF WAVELET FILTERS
In order to improve the efficiency of transformation phase, the following techniques are adopted in the transformation module of the CODEC. Orthogonal wavelet filters such as Haar filter and Daubechies 9/7 filters are considered for intra frames and Discrete Cosine Transform for inter frames of video sequence. Figure 2 illustrates an overview of the encoder of H.264/AVC with a hybrid transformation technique. Previously processed frames (F’n-1) are used to perform Motion Estimation and Motion Compensated Prediction, which yields motion vectors. These motion vectors are used to make a motion compensated frame. In the case of inter frames, the frame is subtracted from the current frame (Fn) and the residual frame is transformed using Discrete Cosine Transform (T) and quantized (Q). In the case of intra frame, the current frame is transformed using
Ubiquitous Computing and Communication Journal
4
Discrete Wavelet Transform (DWT) with different orthogonal wavelet filters such as Haar and Daubechies and quantized (Q). The quantized transform coefficients are then entropy coded and transmitted or stored through NAL along with the motion vectors found in the motion estimation process. +
Fn
T
Q
X
4 Reorder
Entropy encoder
ME DWT
Inter F’n-1
MC
Choose intra prediction
NAL
Intra prediction
Intra IDWT +
F’n
+
Filter
IT
Q’1
Figure 2: Encoder in the hybrid transformation with wavelet filters. For predicting the subsequent frames from the anchor intra frames, the quantized transform coefficients are again dequantized (Q’1) followed by inversely transformed (IT) and retained in the frame store or store memory for motion compensated prediction. Table 1: Biorthogonal wavelets filter coefficients.
EXPERIMENTAL DISCUSSION
Lowpass Filter g L(i)
Highpass Filter g H(i)
0
0.602949018236359
1.115087052456994
±1
0.266864118442872
-
±2
-
-
±3
-
0.091271763114249
±4
0.026748757410809
-
RESULTS
AND
The experiments were conducted for three CIF video sequences such as “Bus” (352x288, 149 frames), “Stefan” (352x288, 89 frames) and “Flower Garden” (352x288, 299 frames), and two QCIF video sequences like “Suzie” (176x144, 149 frames) and “Mobile” (176x144, 299 frames). The experimental results show that the developed hybrid transform coding with wavelet filters combination outperforms over conventional DCT based video coding in terms of quality performance. Peak Signal to Noise Ratio (PSNR) is commonly used to measure the quality. It is obtained from logarithmic scale and it is Mean Squared Error (MSE) between the original and reconstructed image or video frame with respect to the highest available symbol in the spatial domain. P S N R = 1 0 log 1 0
Analysis Filter Coefficients i
avoidance of undesirable blocking artifacts, the intra frame is reconstructed with high quality. The first frame in a GOF is intra frame coded. Frequent intra frames enable random access to the coded stream. Inter frames are predicted from previously decoded intra frames.
(2 n − 1) 2 dB M SE
(9)
where n is the number of bits per image symbol. The fundamental tradeoff is between bit rate and fidelity [17]. The ability of any source encoding system is to make this tradeoff as acceptable by keeping moderate coding efficiency.
Synthesis Filter Coefficients i
Lowpass Filter h L(i)
Highpass Filter h H(i)
0
1.115087052456994
0.602949018236379
±1
0.591271763114247
-
±2
-
-
±3
-
0.016864118442874
±4
-
0.026748757410809
In the case of intra frames, inverse Discrete Wavelet Transform is applied in order to obtain reconstructed reference frames (F’n) through deblocking filter for inter frames of video sequence. The hybrid transformation technique employs different techniques for different categories of frames. Intra frames are coded using both Haar wavelet filter coefficients [0.707, 0.707] and bi-orthogonal Daubechies 9/7 wavelet filter coefficients as shown in Table 1 [16] in different combinations on different decomposition levels. Because of wavelet’s advantages over DCT such as complete spatial correlation among pixels in the whole frame,
Table 2: Proposed combination of wavelets filters. Proposed combination P1 P2 P3 P4
1st level Decomposition Haar Haar Daub Daub
2nd level Decomposition Haar Daub Haar Daub
Table 2 shows the combination of orthogonal Haar and Daubechies 9/7 wavelet filters in different level of decompositions in transform coding. These combinations are simulated in H.264/AVC codec, where the DCT is the de-facto transformation technique for both intra frame and inter frames of video sequence processing. Table 3 shows the performance comparison of the quality parameter in terms of Peak Signal-toNoise Ratio (PSNR) for the existing de-facto DCT transformation with combination of proposed wavelet filters. The values in the table represent the average PSNR improvement for Luminance (Y)
Ubiquitous Computing and Communication Journal
5
component and Chrominance (U and V) components. As per Human Vision System, human eyes are highly sensitive on Luminance than the Chrominance components. In this analysis, both Luminance and Chrominance components are considered due to the importance of colour in near lossless applications. There is 0.12 dB Y-PSNR improvement in P4 combination with DCT transformation for ‘Bus’ CIF sequence. When the comparison has been made for ‘Stefan’ CIF sequence, 0.31 dB Y-PSNR improvement has been achieved in P1 combination with existing transformation. 0.14 dB Y-PSNR quality has been obtained with DCT transformation in P4 combination for ‘Flower-Garden’ CIF sequence. Table 3: PSNR comparison for the various video sequences.
considered in this paper includes the PSNR performance. The performance evaluations show that the hybrid transformation technique outperforms the existing DCT transformation method used in H.264/AVC significantly. The experimental results also demonstrate that the combination of Haar wavelet filter in 1st level of decomposition and Daubechies wavelet filters in 2nd level of decomposition outperforms other combination and the original DCT used in the existing AVC standard. ACKNOWLEDGEMENT The authors wish to thank S. Anusha, A. R. Srividhya, S. Vanitha, V. Rajalakshmi, R. Ramya, M. Vishnupriya A. Arun, V. Vijay Anand, S. Dhinesh Kumar and P. Navaneetha Krishnan undergraduate students for their valuable help. 6
Sequence PSNR Bus
Stefan Flower Garden Suzie
Mobile
Y U V Y U V Y U V Y U V Y U V
Existing (dB) 35.77 35.83 36.04 36.38 35.00 36.90 36.00 36.51 34.93 37.62 43.76 43.32 33.95 35.13 34.92
P1 (dB) 35.03 35.81 36.03 35.69 35.00 36.90 35.72 36.49 34.92 37.57 43.71 43.35 33.92 35.12 34.96
P2 (dB) 35.88 35.83 36.04 36.50 35.01 36.91 36.13 36.47 34.93 37.66 43.72 43.43 34.10 35.10 34.91
P3 (dB) 35.88 35.82 36.03 36.50 35.00 36.91 36.13 36.50 34.94 37.68 43.75 43.39 34.10 35.08 34.91
P4 (dB) 35.89 35.82 36.03 36.50 35.00 36.91 36.14 36.50 34.93 37.68 43.74 43.39 34.10 35.08 34.91
As per QCIF sequences such as ‘Suzie’ and ‘Mobile’ are concerned, up to 0.15 dB Y-PSNR improvement has been achieved when the biorthogonal wavelet filters are considered in the 2nd level of decomposition of the wavelet operation for intra frames of the video sequences. In both CIF and QCIF video sequences, a comparable quality improvement has been attained as per Luminance components such as U-PSNR and V-PSNR are concerned. 5
CONCLUSION
In this paper, a hybrid transformation technique for advanced video coding has been proposed. In which, the intra frames of video sequence are coded by DWT with Haar and Daubechies wavelet filters and the inter frames of video sequence are coded with DCT technique. The hybrid transformation technique is also simulated in the existing H.264/AVC reference software. Experiments were conducted with various standard CIF and QCIF video sequences such as Bus, Stefan, Flower-Garden, Mobile and Suzie. The performance parameter
REFERENCES
[1]
Zixiang Xiong, Kannan Ramachandran, Michael T. Orchard and Ya-Qin Zhang: A Comparative study of DCT and Wavelet-Based Image Coding, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 9, No. 5, pp. 692-695 (1999). [2] N. Ahmed, T. Natarajan and K. R. Rao: Discrete Cosine Transform, IEEE Transactions on Computers, pp. 90-93 (1974). [3] Ingrid Daubechies: Ten lectures on wavelets, Capital city Press, Pennsylvania, pp. 53-105 (1992). [4] Marc Antonini, Michel Barlaud, Pierre Mathieu and Ingrid Daubechies: Image coding using wavelet transform, IEEE Transactions on Image Processing, Vol. 1, No. 2, pp. 205-220 (1992). [5] Gary J. Sullivan, Pankaj Topiwala and Ajay Luthra: The H.264/AVC AVC Standard Overview and Introduction to the Fidelity Range Extensions, SPIE Conference on Applications of Digital Image Processing XXVII (2004). [6] Iain E. G. Richardson: H.264 and MPEG-4 Video Compression, John Wiley & Sons (2003). [7] ftp://ftp.imtc.org/jvt-experts/reference_software. [8] C. E. Shannon: A Mathematical theory of Communication, Bell System Technical Journal, Vol. 27, pp. 623-656 (1948). [9] Kelth Jack: Video Demystified, Penram International Publishing Pvt. Ltd., Mumbai, pp. 234-236 (2001). [10] Allen Gersho: Quantization, IEEE Communications Society Magazine, pp. 16-29 (1977). [11] Peng H. Ang, Peter A. Ruetz and David Auld: Video compression makes big gains, IEEE Spectrum (1991). [12] Frederic Dufaux, Fabrice Moscheni:Motion
Ubiquitous Computing and Communication Journal
6
Estimation Technique for Digital TV-A Review and a New Contribution, Proceedings of IEEE, Vol. 83, No. 6, pp. 858-876 (1995). [13] D. A. Huffman: A Method for the Construction of Minimum-Redundancy Codes, Proceedings of IRE, Vol. 40, No. 9, pp. 1098-1101 (1952). [14] P. G. Howard, J. C. Vitter: Arithmetic Coding for Data Compression, Proceedings of the IEEE, Vol. 82, No. 6, pp. 857-865 (1994). [15] K. R. Rao, J. J. Hwang: Techniques and
standards for Image, Video and Audio Coding, NJ, Prentice Hall, pp. 85-96 (1996). [16] B. E. Usevitch: A Tutorial on Modern Lossy Wavelet Image Compression-Foundations of JPEG 2000, IEEE Signal Processing Magazine, Vol. 18, No. 5, pp. 22-35 (2001). [17] Gary J. Sullivan, Thomas Wiegand: Video Compression – from concepts to the H.264/AVC standard, Proceedings of IEEE, Vol. 93, No. 1, pp. 18-31 (2005).
Ubiquitous Computing and Communication Journal
7