Boosting the Performance of LZW Compression through Block Sorting for Universal Lossless Data Compression Sajib Kumar Saha, Md. Masudur Rahaman† Khulna University/CSE, Khulna, Bangladesh, e-mail:
[email protected] /
[email protected] † Khulna University/CSE, Khulna, Bangladesh, e-mail:
[email protected] Abstract— In this paper a new data compression technique has been proposed based on Lampel Ziv Welch (LZW) coding and block sorting. At first block sorting is performed on input data to produce permuted data. LZW coding is then applied on that permuted data to produce more compressed output. Though block sorting takes some extra time (the amount is very negligible), it increases the performance of LWZ compression much. The proposed model is compared with respect to LZW compression. Keywords—Block sorting, cyclic rotation, permuted string, reverse block sorting, LZW.
I. INTRODUCTION Data compression algorithms are developed to reduce redundancy in data representation in order to decrease data storage requirement. It is broadly divided into 2 categories; lossless and lossy. LZW compression is very much popular in the field of lossless data compression. In that paper a modification of LZW compression has been devised through block sorting [3], which is a concept for producing permuted data. In the organization of this paper we first describe block sorting, and then LZW coding technique. The proposed model and experimental results comes in succession and proves the superiority of the proposed technique over LZW compression. II. LITERATURE SURVAY A. Basic Concepts about Block Sorting Block sorting [3] is a technique for making text files more amenable to compression by permuting the input file so that characters that appear in similar contexts are clustered together in the output. Figure 1 shows how the block sorting would be performed on the shorter text mississippi. First, all the cyclic rotations of the text are produced. Next, the characters are sorted into the order of their contexts. Finally, the permuted characters are transmitted. For example, the complete permuted text for the string mississippi is pssmipissii (Figure 1(c)). It is also necessary to transmit the position in the permuted text of the first character of the original text. In the case of the example in Figure 1, we transmit the number four, since the letter m, which is the first character of the input, appears at position four in the permuted string.
Therefore, the output of the block sorting for the string mississippi is the pair {pssmipissii, 4}. Remarkably, it is possible, given only this output, to reconstruct the original text. This is possible because the matrix in Figure 1 is constructed using cyclic rotations; that is, the characters in the last column of Figure 1(b) cyclically precede those in the first column. Figure 1(d) shows this relationship more clearly: the characters in the left-hand column (which are taken from the last column of the sorted matrix) immediately precede those in the righthand column (the first character of the sorted matrix). If we refer to the first and last columns of the sorted matrix as F and L respectively, then L is the output from the block sorting, and F can be obtained by the decoder simply by sorting the characters of L. Then we see that each character in L is followed by the corresponding character in F. In particular, we know that in this case L[4] is the first character of the text; therefore, F[4] (the letter i) must be the second character. We must now consider the problem of locating the next character to decode. We know that each character in F must correspond to a character in L, since the two strings are merely permutations of one another. However, the letter i occurs four times in L. Which one corresponds to the, i that has just been decoded (the fourth i in F)? To answer this question, we must observe that each group of characters in F occurs in the same order in L. For example, the third s in F corresponds to the third s in L, and so on. Since we are dealing in this instance with the fourth i in F, this corresponds to the fourth i in L (L[11]), which, we then discover, is followed by an s, and so forth. Decoding the rest of the string in a similar manner gives the order 4, 11, 9, 3, 10, 8, 2, 7, 6, 1, 5.
F 1 2 3 4 5 6 7 8 9 10 11
mississippi ississippim ssissippimi sissippimis issippimiss ssippimissi sippimissis ippimississ ppimississi pimississip imississipp
i mississip p i ppimissis s i ssippimis s i ssissippi m m ississipp i p imississi p p pimississ i s ippimissi s s issippimi s s sippimiss i s sissippim i (b) L F
(a) 1 2 3 4 5 6 7 8 9 10 11
1 2 3 4 5 6 7 8 9 10 11
L
p s s m* i p i s s i i
1 2 3 4 5 6 7 8 9 10 11
p i s i s i m* i i m p p i p s s s s i s i s
(d) (c) Figure 1: Block sorting of the string mississippi: (a) rotations of the string; (b) sorted matrix; (c) permuted string (last character of sorted matrix); (d) permuted string and sorted string. B. Basic Concepts about LZW Compression In LZW compression algorithm [1, 2, 4], the input file that is to be compressed is read character by character and they are combined to form a string. The process continues till it reaches the end of file. Every new string is assigned some code and stored in Code table. They can be referred when the string is repeated with that code. The codes are assigned from 256, since in ASCII character set we have already 256(0-255) characters. The decompression algorithm expands the compressed file. Here the file, which is created in the compression, is read character by character and it is expanded. This decompression process doesn’t require the Code table built during the compression. Here the 1st and the 2nd characters are combined to form a string and they are stored in the Code table. The code 256(100h) is assigned to the first new string. Then 2nd and 3rd characters are combined and if that string is not available in the Code table, it is assigned a new code and it is stored in the Code table. Thus we are building a Code table with every new string. When the same string is read again, the code already stored in the table will be used. Thus compression occurs when a single code is outputted instead of a set of characters [1, 5]. The extended ASCII holds only 256(0 to 255) characters [4, 5]and it requires just 8-bits to store each character. But for building the Code table, we have to
extend the 8-bits to few more bits to hold 256(100h) and above. If we extend it to 12-bits, then we can store up to 4096 elements in the table. So when we store each element in the table it is to be converted to a 12-bit number. For example, when you want to store A(dec-65, hex 41), T(dec-84, hex-54), O(dec-79, hex-4F) and Z(dec-90, hex-5A), you have to store it in bytes as 04, 10, 54, 04, F0, 5A . The reason is, we have allotted only 12-bits for each character. Consider a string .ATOZOFC.. It takes 7x8(56) bits. Suppose if a code is assigned to it as 400(190h), it will take only 12-bits instead of 56-bits! Example Input string: ATOZOFCATOZOFCATOZOFC Characters Read
String Stored / Retrieved
Process in Table
A T O Z O F C A T
AT TO OZ ZO OF FC CA AT
Store Store Store Store Store Store Store Retrieve
O Z
AT0 OZ
Store Retrieve
O F
OZO OF
Store Retrieve
C A
OFC CA
Store Retrieve
T O
CAT TO
Store Retrieve
Z O
TOZ ZO
Store Retrieve
F C
ZOF FC
Store Retrieve
In file
Store Store Store Store Store Store Store Store Relevant Code Store Relevant Code Store Relevant Code Store Relevant Code Store Relevant Code Store Relevant Code Store Relevant Code
In this example-string, the first character A is read and then the second character T. Both the characters are concatenated as AT and a code is assigned to it. The code is stored in the Code table. Since this is the first string that
is new to the table, it is assigned 256(100h). Then the second and the third characters are concatenated to form another new string TO. This string is also new to the Code table and the table expands to accommodate this new string and it is assigned the next code 257(101h). Thus whenever a new string is read after concatenation it is assigned a relevant code and the Code table is build. The table expands till the code reaches 4096 (since we have assigned 12-bits) or it reaches the end of file. When the same set of characters that is stored in the table is again read it is assigned to the code in the Code table. Thus according to the number of bits specified by the program the output code is stored. In other words, if we have extended the bits from 8 to 12 then the character that is stored in 8-bits should be adjusted so as to store it in 12-bit format. C. Basic Concepts about LZW Decompression The file that is compressed is read byte by byte [1, 2]. The bytes are concatenated according to the number of bits specified by us. For example, we have used 12-bits for storing the elements so we have to read first 2-bytes and get the first 12-bits from those 16-bits. Using this bits Code table is build again without the Code table previously created during the compression. Use the remaining 4-bits from the previous 2-bytes and next byte to form the next code in the string table. Thus we can build the Code table and use it for decompression. This decompression algorithm builds its own Code table and it will be same as the table created during the compression. The decompression algorithm refers this newly created Code table but not the Code table created during the compression. This is the main advantage in this algorithm. Example Consider the same example given above and do the decompression.
Here each byte is read one by one as hexadecimal code and 3 of the bytes are combined so as to convert them from a 12-bit format to a 8-bit character (ASCII) format. Thus the bytes 04, 10 & 84 are combined as 041084. The combined code is split to get A(041) and T(084). The table
is also built concurrently when each new string is read. When we read 100, 102 etc., we can refer to the relevant code in the table and output the relevant code to the file. For example, when we reach the 4th set of characters and read 04, 31 and 00 they must be converted to 12-bit form as 043 and 100 will refer to the code in the table and outputs the string C and AT respectively. Thus we can get all the characters without knowing the previous Code table. III. A NEW MODEL FOR COMPRESSING DATA LZW works on input data to produce compressed output. But in that model LZW works on block sorted data to produce a more compressed output. The proposed model is given in Figure 2. Input data
Compressed data
Block sorting
LZW decompression
LZW compression
Reverse bock sorting
Compressed data (a)
Source data (b)
Figure 2: Proposed model (a) Compression (b) Decompression. A. Proposed Compression Technique The proposed compression algorithm consists of two phases: 1. Block sorting. 2. LWZ Coding. Block Sorting Block sorting is performed by adopting the following steps as proposed in [3]: i. Write the entire input as the first row of a matrix, one symbol per column. ii. Form all cyclic permutations of that row and write all the permutation as the other rows of the matrix. iii. Sort the matrix rows according to the lexicographical order of the elements of the rows. iv. Take as output the final column of the sorted matrix, together with the number of the row which corresponds to the original input.
LZW Coding [5] i. Specify the number of bits to which you have to extend ii. Read the first character from the file and store it in ch iii. Repeat steps (4) to (7) till there is no character in the file iv. Read the next character and store it in ch2 v. If ch+ch2 is in the table get the code from the table otherwise output the code for ch+ch2 add to the table vi. Store it to the Output file in the specified number of bits vii. ch = ch2 viii. Output the last character ch ix. Exit B. Proposed Decompression Technique The proposed decompression algorithm consists of two phases: 1. LZW Decoding. 2. Reverse block sorting LZW Decoding [5] i. Read the character l ii. Convert l to its original form iii. Output l iv. Repeat steps(5) to (10) till there is no character in the file v. Read a character z vi. Convert l+z to its original form vii. Output in character form viii. If l+z is new then Store in the code table ix. Add l+z first char of entry to the code table x. l = first char of entry xi. Exit
IV. EXPERIMENTAL RESULTS We have taken files from the CORPUS for experiment. The experimental result is given on Table I and Table II. Table I Comparison with LZW compression and the proposed model for files from CORPUS Using the Using File Name Original Proposed LZW Size Model Compression Bib book1 book2 News paper1 paper2 Alice29.txt plrabn12.txt Chntxx.seq Chmpxx.seq Mj Sc
(bytes) 1,11,261 7,68,771 6,10,856 3,77,109 53,161 82,199 1,55,648 4,81,861 1,55,844 11,67,360 4,48,779 29,00,352
(bytes) 46,864 3,54,912 3,04,800 1,87,456 19,980 32,567 59,020 2,82,049 63,625 3,92,905 2,45,927 13,63,336
(bytes) 36,804 2,94,991 2,34,050 1,47,356 17,198 27,467 51,102 1,82,104 39,125 2,92,303 1,45,967 11,63,361
Table II Comparison with LZW and the proposed model for compression and decompression time Compress time Decompress time File Name Original Size Using LZW Using the Using LZW Using the Proposed Proposed Model Model (bytes) (ms) (ms) (ms) (ms) Bib 51, 202 885 363 874 351 book1 39, 821 554 201 547 194 book2 8, 714 102 35 102 35 News 65, 178 775 309 769 263 paper1 1,37, 566 1954 688 1912 671 paper2 1,64, 597 2652 1063 2550 994 Alice29.txt 1,55,648 2420 1005 2344 974 plrabn12.txt 4,81,861 4680 1105 4598 1099 Chntxx.seq 1,55,844 2199 823 2176 797 Chmpxx.seq 11,67,360 20102 7101 19231 7067 Mj 4,48,779 7764 2522 7681 2501 Sc 29,00,352 35304 10211 28678 9952 bible.txt 40,47,392 37003 12995 31453 12601 E.Coli 46,38,690 38906 13403 33211 13210 Total 159,99,283 155300 51824 136126 49735
V. CONCLUSION The proposed model modifies the LZW compression
[2]
Jacob Ziv and Abraham Lempel, Compression of Individual Sequences Via Variable-Rate Coding, IEEE Transactions on Information Theory , September 1978.
[3]
Burrows, M. and Wheeler, D.J (1994) A block sorting Lossless Data Compression Algorithm, SRC Research Report 124, Digital System Research center, Palo Alto,CA,gatekeeper.doc.com, /pub/DEC/SRC/research-reports/SRC124.ps.Z. http://marknelson.us/1989/10/01/lzw-datacompression http://en.wikipedia.org/wiki/LZW#References
[1, 2] by performing block sorting [3] on inputted data first; and then performs all the operations similar as [2]. The results have shown that the method achieves better compression ratio but an increase in compression and decompression time. Here the increase of compression and decompression time is very negligible considering the increase of compression ratio.
REFERENCES [1]
Welch, T. A. , A technique for highperformance data compression, Computer. Vol. 17, pp. 8-19, June 1984.
[4] [5]