Class Coding And Two Applications

  • May 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Class Coding And Two Applications as PDF for free.

More details

  • Words: 15,737
  • Pages: 92
TEL-AVIV UNIVERSITY The Iby and Aladar Fleischman Faculty of Engineering

Class Coding and Two Applications

A thesis submitted toward the degree of Master of Science in Electrical and Electronic Engineering

By

Avishay Orpaz

July 2004

ii

TEL-AVIV UNIVERSITY The Iby and Aladar Fleischman Faculty of Engineering

Class Coding and Two Applications

A thesis submitted toward the degree of Master of Science in Electrical and Electronic Engineering

By

Avishay Orpaz This research was carried out in the Department of Electrical Engineering – Systems under the supervision of Dr. Shlomo Weiss

July 2004

iv

To my wife Idit Who spent on this work as much time as I did

vi

Contents

1 Introduction and Related Work

1

1.1

Goals and Objectives . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2.1

Code Compression in Embedded Systems . . . . . . . .

2

1.2.1.1

Motivation . . . . . . . . . . . . . . . . . . .

2

1.2.1.2

Related Work . . . . . . . . . . . . . . . . . .

3

Compressed Pattern Matching . . . . . . . . . . . . . .

6

1.2.2.1

Motivation . . . . . . . . . . . . . . . . . . .

6

1.2.2.2

Pattern Matching . . . . . . . . . . . . . . . .

7

1.2.2.3

Text Compression . . . . . . . . . . . . . . .

7

1.2.2.4

Compressed Matching Problem Definition . .

8

1.2.2.5

Related Work . . . . . . . . . . . . . . . . . .

9

1.2.2

1.3

Outline of the proposed algorithm . . . . . . . . . . . . . . . . 11

1.4

Structure of the work . . . . . . . . . . . . . . . . . . . . . . . 12

2 Class Compression

15

2.1

The Class Compression System . . . . . . . . . . . . . . . . . 15

2.2

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 vii

viii

CONTENTS

2.3

2.4

2.2.1

Compression Ratio . . . . . . . . . . . . . . . . . . . . 17

2.2.2

Redundancy . . . . . . . . . . . . . . . . . . . . . . . . 21

The optimal class structure problem . . . . . . . . . . . . . . . 22 2.3.1

Brute force approach . . . . . . . . . . . . . . . . . . . 22

2.3.2

A faster algorithm . . . . . . . . . . . . . . . . . . . . 23

2.3.3

The Optimization Algorithm and the Prefix . . . . . . 27

2.3.4

Complexity . . . . . . . . . . . . . . . . . . . . . . . . 30

2.3.5

Example . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.3.6

Implementation . . . . . . . . . . . . . . . . . . . . . . 32

Experimental results . . . . . . . . . . . . . . . . . . . . . . . 33 2.4.1

Implementation Issues . . . . . . . . . . . . . . . . . . 34

2.4.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3 Code Compression for Embedded Systems 3.1

3.2

37

System Parameters . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1.1

Probability Model . . . . . . . . . . . . . . . . . . . . . 37

3.1.2

Codebook Limit . . . . . . . . . . . . . . . . . . . . . . 39

3.1.3

Static vs. Dynamic Structure . . . . . . . . . . . . . . 39

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 42

4 Multi-resolution string matching

47

4.1

Low resolution text . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2

A string matching algorithm . . . . . . . . . . . . . . . . . . . 49 4.2.1

Definition . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2.2

Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2.3

Double-byte symbols . . . . . . . . . . . . . . . . . . . 53

CONTENTS 4.3

ix

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3.1

Relation of number of classes to search speed . . . . . . 57

4.4

Extra space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.5

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 58

4.6

4.5.1

Implementation Notes . . . . . . . . . . . . . . . . . . 58

4.5.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Enhancement Possibilities . . . . . . . . . . . . . . . . . . . . 64

5 Summary

67

x

CONTENTS

List of Figures 1.1

Computing System with Code Compression . . . . . . . . . .

5

2.1

Entropy Coder . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2

Class Structure Optimization Example . . . . . . . . . . . . . 32

2.3

Compression ratio for some files from the testset with different number of classes . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.4

Approaching the Huffman limit (X axis is logarithmic for convenience) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.1

Class Structure of CodePack . . . . . . . . . . . . . . . . . . . 41

3.2

Comparison of Probability Models for Code Compression . . . 42

3.3

Comparison of Dictionary Sizes for Different Models . . . . . . 43

3.4

The Effect of Codebook Limit (with varying N ) . . . . . . . . 44

3.5

The Effect of Codebook Limit (fixed N ) . . . . . . . . . . . . 44

3.6

Static Class Configuration . . . . . . . . . . . . . . . . . . . . 45

4.1

Multiple resolution text using class coding . . . . . . . . . . . 49

4.2

Class-compressed text stream format . . . . . . . . . . . . . . 50

4.3

Illustration of string matching procedure . . . . . . . . . . . . 54 xi

xii

LIST OF FIGURES 4.4

Class coding compared with other compression algorithms . . 61

4.5

False matches results . . . . . . . . . . . . . . . . . . . . . . . 61

4.6

Search time results . . . . . . . . . . . . . . . . . . . . . . . . 62

4.7

Search time versus false matches . . . . . . . . . . . . . . . . . 62

4.8

Optimal selection of N . . . . . . . . . . . . . . . . . . . . . . 63

4.9

Comparison of Search Times . . . . . . . . . . . . . . . . . . . 63

List of Tables 1.1

Taxonomy of compressed matching methods . . . . . . . . . . 10

2.1

Symbol Probabilities for the example . . . . . . . . . . . . . . 31

2.2

Class Assignment for the example . . . . . . . . . . . . . . . . 31

2.3

Final Symbol Coding . . . . . . . . . . . . . . . . . . . . . . . 31

2.4

Datasets for compression experiments . . . . . . . . . . . . . . 34

xiii

xiv

LIST OF TABLES

Abstract This work presents a coding system called class coding and two of its applications. The class coding system is a compression scheme that has unique features. The system is analyzed and a novel algorithm is presented to optimally adjust its parameters. The first application of this system is code compression. The work explores the parameters of the system that are relevant to such usage. The second application is compressed matching – a class of algorithms that are designed to search a text within another, compressed text, without having to completely decompress it. The features of class coding are used to provide fast, flexible and efficient search algorithm.

xv

Chapter 1 Introduction and Related Work 1.1

Goals and Objectives

This work deals with two subjects: code compression in embedded systems and compressed pattern matching. The common denominator to these, usually unrelated, problems is a compression method called class compression. The objectives of the work are as follows: • Study the class compression method, its properties and develop an algorithm to optimize it (in the sense of compression ratio) • Explore the use of class compression for the application of code compression in embedded systems. Parameters that have influence over the system performance and tradeoffs will be introduced and studied. • Develop a method to perform pattern matching in class-compressed files, making use of its special properties to achieve performance benefits. 1

2

CHAPTER 1. INTRODUCTION AND RELATED WORK

1.2

Introduction

1.2.1

Code Compression in Embedded Systems

1.2.1.1

Motivation

Driven by an expanding market for consumer electronics and communications equipment, embedded software is becoming increasingly complex [37, 38]. In high-end embedded products, 32-bit microprocessor cores provide the computing power needed to run complex algorithms in real-time. Timely development of such complex and large embedded applications requires the use of high-level languages and compilers instead of manually crafted assembly code. From the system point of view, larger applications and compiled code are both factors that add up to a requirement for larger instruction memory. Another factor is the system software. Currently, many embedded products use real-time operating systems with modest memory requirements, typically in the range of 10KB to 100KB. Embedded versions of Linux and Windows are becoming increasingly popular [50] in high-end 32-bit applications and in products that do not have tight real-time requirements, such as settop boxes and networked game consoles. Scaled down versions of Linux or Windows NT may require a few megabytes of memory. The RISC architecture, which is now dominant in embedded systems (practically in all but computers based on the old x86 architecture) uses fixed length instruction words to utilize fast decoding and branching. Unlike variable length instruction word architectures, the fixed-length methods sacrifices speed to code size (because some instructions includes unused bits).

1.2. INTRODUCTION

3

This is usually not a concern in PCs and other non-embedded systems, where memory is in abundance. In embedded systems the picture is different. These systems are typically subject to space, power and cost constrains, in which instruction memory, as any other resource, is limited. The available instruction memory space may be better utilized by encoding embedded software in a compact format. This cost effective approach received a lot of attention in recent years. The concept is to compress the instructions in the memory (using one of the many available compression methods or new ones) thereby saving memory, while the decompression is performed in real time, just before the code is executed. This function is usually being performed by special hardware unit. Not every compression method is suitable for such application. The system must be capable of decompressing the code quickly, affecting performance as little as possible, and handle the compressed address space efficiently. All theses mechanisms are usually required to work transparently in respect to the software.

1.2.1.2

Related Work

Several approaches have been used to produce compact code. Thumb [56] and MIPS-16 [28] offer extended instruction sets that include short instructions for embedded applications. The use of short instructions adds minimal run-time overhead. This is not compression method per se, but code size is reduced. The major drawback of this technique is that code has to be recompiled (if it is hand written in assembly, it should be rewritten). Moreover, all the development tools must be modified to utilize the new instruction. This method is also very processor-specific. It cannot be ported

4

CHAPTER 1. INTRODUCTION AND RELATED WORK

to other processor without being almost completely redesigned. A second approach is to reduce the size of the compiled code by generating a custom instruction set, matched to the characteristics of the compiled program [18, 22]. The custom instructions are interpreted at a speed slower by a factor of 2-4 relative to compiler generated code [19], or a tailored decoder may produce the processor’s internal signals directly [34]. The third approach, illustrated in figure 1.1, takes the binary image of regular code, generated by the regular development tools and compresses it. This compressed code cannot, of course, be executed on it own - it has to be decompressed by some hardware unit in the target system. Several implementations for this approach have been suggested. Chen [13, 12] describes a dictionary based system, in which common code segments are replaced by a single reference. At runtime, the decompression core retrieves these code segments back for execution. Another paper, based on the same method is by Lefurgy[39]. Other researchers tried other compression algorithms. Kozuch[32, 33] studied the statistical properties of RISC code using 0-order and 1-order models. Ernst[16] uses a method called operand factorization along with a Markov probability model. Lekatsas[40] and Xie[64] use arithmetic coding (reduced to accommodate the fast decoding requirement); Liao[41] uses a dictionary based method called EPM; Wolfe and Chanin[62] use Huffman based methods. Finally IBM’s CodePack system[20] is of special interest to our work. It will be described in details in a later chapter. Another important aspect of code compression systems is the address mapping problem. After compression, the address of some instructions changes. Branch instructions specify a destination address, which was calculated by

1.2. INTRODUCTION

5

"% %

! "

#

"

$

%""

"% %

" & '

! "

"

#

$

%""

Figure 1.1: Computing System with Code Compression (a) Hardware configuration; (b) Hardware configuration with code compression; (c) Software development workflow; (d) Software development workflow with code compression

6

CHAPTER 1. INTRODUCTION AND RELATED WORK

the compiler according to the uncompressed code. After compression, this address may be no longer correct. It may also be possible that the new address does not even fall on a byte boundary. Two common methods exists to overcome this problem. The first (introduced in [62]) uses a line address table, which is a table used to translate at runtime the address from the uncompressed address space to the compressed address space. The second approach, introduced in [39], scans the compressed code and patches all the branching instruction to their correct destination.

1.2.2

Compressed Pattern Matching

1.2.2.1

Motivation

The problem of pattern matching is a fundamental problem in computer science, concerned at finding the first (or all) occurrence of a string within other, larger string. In this work we address an instance of the string matching problem usually known as the compressed matching problem. We are interested in finding an occurrence of a string within another string, but in this case the second string is compressed by some compression method. We want to do that without having to decompress the large string first. It is also desirable that the algorithms allowing such operation would do it faster than the simple solution and of course afford as much compression as possible. This problem is becoming more and more important due to the vast amount of information modern servers and personal computers have to deal with.

1.2. INTRODUCTION 1.2.2.2

7

Pattern Matching

The problem of pattern matching is formalized as follows (from [54]): Given a pattern string x, with |x| = m, and a text string y, with |y| = n, where m, n ≥ 0 and m ≤ n, if x occurs as a substring of y then determine the position within y of the first occurrence of x, i.e. return the least value of i such that y(i, i + m − 1) = x(1, m). The simplest form of string matching is the brute force approach. This algorithm simply scans the y and compare every character with a character in x. This algorithm has two drawbacks: first, it requires the ability to go back in y; and second, it has a worst case time complexity of O(m · n). Two algorithms that overcome these problems are the Knuth-MorrisPratt[31] and Boyer-Moore[7]. Both have as worst case time complexity of O(m + n) but the last one possesses sublinear time in the average case, which makes it extremely important and useful. It is also important to note here a fundamental result by Rivest[47], who showed that in the worst case, no algorithm can perform string matching with less than n − m + 1 comparisons, thus making sublinear-in-the-worstcase algorithm impossible. 1.2.2.3

Text Compression

The goal of text compression algorithms is to encode a given string (text) in such a way, that it will consume as little storage as possible. Certainly, the encoding must be reversible: for a given encoding algorithm C, there exist a decoding algorithm D such that t = D(C(t)).

8

CHAPTER 1. INTRODUCTION AND RELATED WORK This problem has been extensively studied, and the two most common

coding families today are the entropy coding and the dictionary coding. In entropy coding, the goal is to find the most suitable code for every symbol, such that the total length of the encoded text is minimized. Some members of this family are the Huffman[23] and arithmetic coding. Dictionary methods, in contrary, try to replace frequently occurring substring with references to their previous occurrence. Some members are Lampel-Ziv and their variants. In the recent years, some newer methods have emerged such as antidictionaries[14], byte pair encoding[14] and block sorting[8]. The performance of a compression algorithm is measured by two factors: the compression it can attain and the speed of compression/decompression.

1.2.2.4

Compressed Matching Problem Definition

The methods described in the previous section do achieve their goals, but the encoded text, in the general case, cannot be searched for a pattern using the algorithms described in 1.2.2.2. For them to work, the compression algorithms change the structure, the symbol encoding and order of the text, thus making the direct application of the plain text matching algorithms impossible. Searching compressed texts is possible in one of two ways: 1) decompressing the file then employing a known searching algorithm; and 2) using a new algorithm designed specifically for compressed pattern matching. The first approach has some drawbacks. Decompressing a file can consume, in some cases, considerable time and space. Moreover, in certain cases it is desirable to search a text repeatedly for different patterns. In the direct method we

1.2. INTRODUCTION

9

will have to decompress the same text over and over again. In some other cases, we are not interested in the entire file, but only in the area where the desired string has been found. Again, we will have to decompress the entire text. The second approach has the potential of overcoming all these drawbacks if a good algorithm is found. Amir and Benson[1] were the first to define this problem and give performance goals (in the following definition u is the length of the uncompressed text, n is the length of the compressed text and m is the length of the pattern being searched): • An algorithm that finds all the occurrences of the pattern in the text in O(u) time is efficient • An algorithm that finds all the occurrences of the pattern in the text in O(n log m + m) time is almost optimal • An algorithm that finds all the occurrences of the pattern in the text in O(n + m) time is optimal In later paper, Amir, Benson and Farach[2] observe that there is sometimes a tradeoff between the execution speed and the extra space consumed. The amount of extra space used is, therefore, another important performance measure. 1.2.2.5

Related Work

The work on this subject can be coarsely divided by the type of algorithm used to compress the text. Table 1.1 presents this taxonomy.

10

CHAPTER 1. INTRODUCTION AND RELATED WORK Run-length Amir and Benson [1], Eilam-Tzoreff and Vishkin[15], Amir, Landau and Vishkin[3]

Huffman De Moura et al.[44], Manber[42], Klein and Shapira[30]

LZ* Kida et al.[26], Kida et al.[27], Amir, Benson and Farach[2], Klein and Shapira[29], Farach and Thorup[17], G¸asieniec and Rytter[21]

Other Shibata et al.[51], Shibata et al.[52], Takeda[55], Mitarai et al.[43], Tarhio et al.[46]

Table 1.1: Taxonomy of compressed matching methods

The first group (also chronologically) of methods is based on run-length encoding. The most salient members are Tzoreff and Vishkin[15] who studied the problem in one dimension; Amir, Landau and Vishkin[3] and Amir and Benson[1]. The last two works studied the problem in two dimensions. The last one, as already mentioned, is the work that formulated the problem of compressed matching and defined it. The second group is based on Huffman[23] encoding. De Moura et al.[44] suggested a method base on compressing complete words, rather than characters using Huffman code. A special mark bit is used to denote a word beginning allowing fast searches (thereby reducing compression levels). Klein[30] takes an interesting approach and uses a probabilistic search method that relies on the resynchronization property of Huffman codes. The third group is based on the Lampel-Ziv family of compression algorithms. It should not come as a surprise that this is the largest group of all, since the Lampel-Ziv algorithms are being extensively employed to compress

1.3. OUTLINE OF THE PROPOSED ALGORITHM

11

text with very good results. Some algorithms in this group try to perform search on files compressed by standard tools, particularly the Unix compress tool which uses LZW[60]. Two examples for such approach is Amir, Benson and Farach[2] and Kida et al.[26] who try to make use of the efficient ShiftAnd algorithm for searching in (uncompressed) text. Other authors change the base encoding scheme to facilitate the search. Klein[29], for example, replaces the back pointers in the LZSS scheme by forward pointers. The fourth and final group contains other kind of methods. Manber [42] designed a system in which the 128 unused codes in regular texts (assuming ASCII encoding) are utilized to encode common character pairs. The author describes a method for selecting these pairs in such way that no ambiguity will be present. Shibata et al.[51] uses a hierarchical method in which every stage consists of removing common symbol pairs from the text and replacing them by a new symbol. Another scheme by Shibata[53] searches for pattern in text compressed using antidictionaries (which are lists of forbidden phrases, see Crochemore[14]). Two recent results are due to Tarhio[46] which generalizes the method proposed by De Moura[44] and Shibata et al.[52] who uses BoyerMoore algorithm to search through BPE compressed text.

1.3

Outline of the proposed algorithm

We propose an algorithm that allows text (over some arbitrary alphabet Σ) compression and the capability of searching directly in the compressed text. The basic compression algorithm is inspired by CodePack, an algorithm used by IBM for code compression in their PowerPC processor[20]. It is a zero-

12

CHAPTER 1. INTRODUCTION AND RELATED WORK

order substitution scheme, in which every symbol is replaced by a variable length codeword. The codeword is comprised of two parts. The first part selects a specific subset of the alphabet, and the second part selects a symbol within this subset, thereby resolving the original symbol. When looking at the first part of all the codewords of the text, we get a low resolution image of the text. This image has certain properties, that allows us to search it in a very fast manner and to filter out many locations in which the pattern being searched cannot occur. We present an algorithm that optimally partitions the alphabet into subsets and then we prove that searching the low resolution indeed can reduce the necessity of completely decompressing the text and searching in it. All the algorithms have been implemented, and the experimental results are also given. The algorithms described here were previously published in [4] and [5].

1.4

Structure of the work

In the first chapter, we give a general overview of the two problems we are dealing with – code compression in embedded systems and compressed pattern matching. We give an overview of the works that have been done in these research areas along with some recent results. The second chapter deal with the compression method that we propose - the class compression. After explaining the scheme, we develop an algorithm for optimizing the compression and discuss the influence of various parameters on the compression rate.

1.4. STRUCTURE OF THE WORK

13

The third chapter is devoted to code compression. We describe the usage of class compression in this application and explore the parameters influencing the system performance. The fourth chapter describes the compressed pattern matching application. The algorithm is described, analyzed and experimental results are given. The fifth and final chapter summarizes the work.

14

CHAPTER 1. INTRODUCTION AND RELATED WORK

Chapter 2 Class Compression 2.1

The Class Compression System

The class compression system, used by CodePack[20, 25] and previously used by Said and Pearlman[49] (with some variations) is a 0-order entropy coder. Each symbol in the original text is replaced by some codeword. The length (in bits) of the input symbols if fixed, but the length of the output codeword is variable. By tuning the codeword length according to their probability of occurrence, compression can be attained. The most known coder of this kind is the Huffman[23] code, which describes a way to optimally select the length and the bits of the codeword for every symbol. Huffman code possesses the prefix property, which means that no codeword is a prefix of another codeword. That way, the code can be decoded without ambiguity. The class compression system takes a similar approach. Let Σ = {σ1 , σ2 , . . . , σK } be some arbitrary alphabet. Let {C1 , C2 , . . . CN +1 } be a set of N + 1 proper subsets of Σ, such that 15

16

CHAPTER 2. CLASS COMPRESSION

Figure 2.1: Entropy Coder

Σ=

N [

Cj

(Cover)

(2.1)

j=1

Ci ∩ Cj = ∅ ∀i, j

(Uniqueness)

(2.2)

Each Cj is assigned a codeword that identifies it and then another codeword for every member within the selected subset. The first codeword is called prefix and the latter is the index. Every symbol in the alphabet is represented by a prefix-index pair and from (2.1) and (2.2) it is clear that that every symbol belongs to one and only one subset (will be called a class from now on), thus the process can be reversed and the original text can be extracted from the codewords. The codewords to the prefix and to the index are allocated differently. The prefix uses Huffman codes. The index uses a fixed length binary code – every class has an associated index length (which may be different among classes). The length of that code must be selected according to the cardinality of the

2.2. ANALYSIS

17

class in order to provide representation for every member. More specifically, the length of the index of class Cj is not less than dlog2 |Cj |e. In order to decompress a stream of class coded text, another piece of information is required – a codebook is necessary to translate between the codewords and the original symbols. However, including all the symbols in the codebook may not be economical, since some appear very infrequently, some may even appear only once. To solve this problem, the literal class is added to the system. The members of the literal class are all the symbols that are not members of any other class. It has a prefix exactly like any other class, but instead of index, the original symbol is copied verbatim. Therefore, it is not necessary to include them in the codebook. The drawback here is that the compressed codeword is longer than the uncompressed symbol, but since they are infrequent, they have little influence on the overall length of the compressed text. This situation also occurs in other entropy coders.

2.2 2.2.1

Analysis Compression Ratio

Let Σ = {σ1 , σ2 , ..., σK }

(2.3)

be an alphabet, with every symbol σi having an associated with an occurrence probability pi . Without loss of generality, we can assume that the the symbols are sorted by their probability, that is p1 ≥ p2 ≥ ... ≥ pK .

18

CHAPTER 2. CLASS COMPRESSION Let C = {C1 , C2 , ..., CN , CN +1 }

(2.4)

be a set of classes, with CN +1 being the literal class. A class configuration is a mapping

G:Σ→C

(2.5)

A class configuration shall be called valid if it complies with (2.1), (2.2) and one more rule:

σi ∈ Cj ⇒ σi+1 ∈ Cj 0 j 0 = j or j + 1 The compression ratio

1

(2.6)

that will be achieved under such parameter sys-

tem can be obtained by counting the bits in the individual components of the code:

I

z N X R=

}|

III

{

II

z

z }| { Pi · (πi + log2 |Ci |) + (B + πN +1 ) · PN +1 + B ·

i=1

}| N X i=1

B

{ |Ci | (2.7)

where B is the word length of each symbol in the uncompressed text and Pi is the probability that a symbol belongs to Ci : )| )| The literature defines compression ratio in two different ways: |C(T and 1 − |C(T |T | |T | , where |T | is the length of the text and |C(T )| is the length of the compressed text. We will adhere to the first form, meaning that the smallest the number, the better is the compression. 1

2.2. ANALYSIS

19

Pi =

X

pj

(2.8)

j:σj ∈Ci

Equation (2.7) has 3 components. The first component (designated I) represents all the symbols that are encoded in all the classes except for the literal class. Every such symbol is encoded by a prefix, whose length (in bits) for class Ci is πi and an index whose length is dlog2 |Cj |e. The second part, designated II, counts the bits in the symbols included in the literal class. Their codes include the prefix of the literal class and the uncompressed codeword. The last part, III, is the space required by the codebook, which has one entry of the length of the uncompressed codeword for every symbol that does not belong to the literal class. Several factors govern the length of the codeword stream given a text (the compression ratio): • Number of classes – To understand the influence of the number of classes on the compression ratio, we look at two extreme cases. The first case is N = 0 (there is only one class – the literal class). Since there is only one class, its prefix is 0 bits long and we have only the index. The index in the literal class, however, is a copy of the original symbols so in this case we have |C(T )| = |T |. The second case is when N = |Σ|. In this case every symbol in Σ has its own class with 0 bit index. The prefix is Huffman coded so we get ˜ )| that every symbol is Huffman coded. So in this case |C(T )| = |H(T ˜ ) is the Huffman coding of T ). (where H(T Between these two extremes, the compression ratio rises monotoni-

20

CHAPTER 2. CLASS COMPRESSION cally. To prove that, let us assume that we have a class structure with N classes over some alphabet. We build a new class structure with N + 1 classes using the following procedure. First, we pick some class arbitrarily having two symbols or more in it. Then, we split it into two classes, each containing exactly half the symbols of the original class (always possible because class cardinality is an integral power of 2). The number of bits required to represent each symbol in each new class is one bit less than was required by the original class. The prefixes of the two new classes would be the prefix of the original class appended by a single bit, selecting one of the new classes. The new structure obtained is clearly valid, has N + 1 classes as required, and uses exactly the same number of bits. This structure is not necessarily optimal, meaning that there might be some other structure that uses less bits (but not more). Therefore, the compression ratio, as a function of the number of classes is monotonic. To sum up, the more classes we allow, the better compression ratio we expect to achieve. • Class structure – In its documentation ([24, 25, 20]) IBM doesn’t reveal how it has determined its class structure. This is the main issue dealt with here, and an optimal class structure optimization will be presented in the next section. • Model – The model used to decide what are the symbols of the text and what is the probability of every symbol. Two choices are possible with class compression. The first is a static model, built using a-priori

2.2. ANALYSIS

21

knowledge of the probabilistic properties of the text being compressed; and a semi-static model in which the text is scanned prior to compression and the number of symbols from each type is counted. We will restrict ourselves to zero-order models, which means that a symbol is always replaced by unique codeword, regardless its position in the text or its context. Class coding can be used for higher order models, by augmenting the alphabet appropriately.

2.2.2

Redundancy

We start our analysis following [49]. Our model is an i.i.d source of symbols σi each having a probability of pi . The entropy of the source is:

H=−

K X

pi log2 pi

(2.9)

i=1

After separating index and prefix, the entropy would be:

H=−

=−

N X

Pi log2 Pi −

N X

i=1

i=1

N X

N X

Pi log2 Pi −

i=1

Pi

X pj pj · log2 Pi Pi j∈C i

X

i=1 j∈Ci

(2.10)

pj pj log2 Pi

In our coding scheme, the indices are not entropy coded, but have a fixed length. Thus the second additive term in (2.10) changes:

0

H =−

N X i=1

Pi log2 Pi −

N X X i=1 j∈Ci

pj log2

1 |Ci |

taking the difference between (2.10) and (2.11) we get:

(2.11)

22

CHAPTER 2. CLASS COMPRESSION

0

∆H = H − H =

N X X

(pj log2

i=1 j∈Ci

pj · |Ci | ) Pi

(2.12)

which leads us to an important conclusion: losses due to the fixed length of the index are minimized if pi ≈

2.3

Pj |Cj |

for σi ∈ Cj .

The optimal class structure problem

As already has been shown in the last section, the class configuration has a major impact on the compression ratio. In this section we investigate this problem, and propose an efficient algorithm to find such mapping.

2.3.1

Brute force approach

Our goal is to assign K symbols to N +1 classes under the constraint of (2.2), (2.1) and (2.6). The number of members of each class (with the exception of the literal class) is an integral power of 2, therefor it is possible to write:

|Ci | = 2zi zi is integer, 1 ≤ i ≤ N

(2.13)

and from (2.2) and (2.1) we get N X

|Ci | ≤ |Σ|

(2.14)

i=1

replacing (2.13) in (2.14) N X i=1

2zi ≤ K

(2.15)

2.3. THE OPTIMAL CLASS STRUCTURE PROBLEM

23

The problem now is how many vectors Z = (z1 , z2 , ..., zN ) satisfy (2.15). To bound this, let us assume z1 = z2 = ... = zN = z. Now (2.15) becomes

N · 2z ≤ K

(2.16)

z ≤ log2 (K/N ) = log2 K − log2 N ≈ log2 K

(2.17)

and therefore

and every Z where zn ≤ z would now satisfy (2.15). The last approximation is due to the fact, that in real world application K  N . It is evident that this is a lower bound since there might be some vectors Z in which some component is greater than z still satisfying (2.15). Finally, brute force scheme would have to check at least

z N = (log2 K)N ≈ O((log K)N )

(2.18)

vectors. In practice (as we shall see later), N is rather small, so that brute force search is feasible. Nevertheless, it is desirable to find an algorithm that can solve the problem in much shorter time (both from complexity point of view and in practice).

2.3.2

A faster algorithm

We introduce a graph G = (V, A) having K +1 nodes – one node corresponding to each symbol in Σ and one additional, final node.

24

CHAPTER 2. CLASS COMPRESSION

V = {vk |1 ≤ k ≤ K + 1}

(2.19)

The nodes are connected by arcs of two types – final and non-final.

A = ANF ∪ AF

(2.20)

Non-final arcs are defined as: ANF = {ai,j | ∀i, j,

i<j≤K (2.21)

and (j − i) is an integral power of 2} Final arcs are defined as:

AF = {ai,j | ∀i, j,

j = (K + 1)}

(2.22)

Each arc in the graph corresponds to a feasible class – a class that may be selected to be included in C. A non-final arc extending from vi to vj represents a class in which the symbols σi , σi+1 , ..., σj−1 are included. A final arc extending from vi to vK+1 (note that a final arc always ends at the final node) represents a literal class containing the symbols between σi to the end of the alphabet. Following these definitions, we look at a path from v1 to vK+1

P = a1,z1 → az2 ,z3 → ... → azN ,K+1 such that

(2.23)

2.3. THE OPTIMAL CLASS STRUCTURE PROBLEM

1 < z1 < z2 < ... < zN

25

(2.24)

Such path, defines a class configuration:

C(P ) = {C1 , C2 , ..., CN , CN +1 } where σ1 , ..., σz1 −1 ∈ C1 σz1 , ..., σz2 −1 ∈ C2

(2.25)

... σz` , ..., σz`+1 −1 ∈ C`+1 ... σzN , ..., σN ∈ CN +1

We now claim that a class structure (2.25) corresponding to a path of the form (2.23) over V , is valid. In order to prove that C(P ) is valid, it is necessary to show that it satisfies (2.2), (2.1) and (2.6). The uniqueness and cover properties are direct corollary of (2.24). For any σi , exists one and only one pair zk and zk+1 , such that zk ≤ i ≤ zk+1 and from (2.25) we conclude that σi ∈ Ck . To prove that 2.25 satisfies (2.6), let assume σi , σj ∈ C`+1 (i < j). This means that z` ≤ i < j ≤ z`+1 − 1 and from (2.24) we see that for any σk , i < k < j also σk ∈ C`+1 .

26

CHAPTER 2. CLASS COMPRESSION Each arc has a weight:

wi,j =

  Pj−1  ( `=i p` ) · log2 (j − i) + (j − i) · B  P   K `=i p` · B

if ai,j ∈ ANF if ai,j ∈ A

(2.26)

F

The weight is taken directly from (2.7). The weight of non final arcs is taken from parts I and III, while the weight of final arcs is taken from part II. The weight of the arc is the average number of bits per symbol

2

such class

would have required in the compressed text. Note that this weight includes the number of bits required in the codebook. The arcs that extend from any node vi to the final node correspond to a literal class that begins at symbol σi . Comparing (2.26) and (2.7), one can notice that the terms πk are absent from the former. These terms are related to the prefix bits. A key property of (2.26) is that the weight of a certain class is dependent only on its size and content (the probabilities of the symbols in it) and it is absolutely independent of any other class. Including the prefixes changes this situation. The prefixes are determined by applying Huffman’s algorithm[23] to the weights of all the classes (2.8). Therefore, a knowledge of all the classes is required to find even a single class prefix length. We claim, however, that the influence of ignoring the prefix length is minor. This claim shall be left as a heuristic, since it cannot be proved exactly, only a general reason will be given. 2

When using the symbol probability pi as a measure, (2.26) gives the average bit per symbol. In practice, it is much more efficient to make these calculation in terms of actual bits, thereby eliminating costly floating-point calculation. This can be achieved easily by multiplying all the probabilities by the text length, making them count the actual number of occurrences of a symbol in the text. This step has no influence when seeking minimum or maximum

2.3. THE OPTIMAL CLASS STRUCTURE PROBLEM

27

If, instead of Huffman codes, we would have used other code for the prefix, one that achieves the entropy bound, it would be possible to completely separate the problem of index and prefix (as in (2.10)). Huffman codes have some redundancy over the entropy, so this statement does not hold. It is also very difficult to bound this redundancy based on a single symbol probability[61]. Finally, we can write the optimization problem as a graph-theory problem: Given a graph G = (V, A) with V and A as defined, find the shortest path from v1 to vK+1 using at most N+1 steps. Solving this problem is an small modification of the well known BellmanFord shortest path problem (see, for example [36]). In this algorithm the j’th iteration finds the shortest path between the origin to all the other nodes using exactly j +1 steps (arcs). Usually, this process continues until no improvement can be made to the path, or when all the arcs have been tested. In our case, we will simply stop the algorithm after N + 1 iterations and take the shortest path found between the origin and the final node. Having the path, we would translate it into a class configuration using (2.25) which is now promised to be both valid and optimal in the sense of compression ratio.

2.3.3

The Optimization Algorithm and the Prefix

As already mentioned, the prefix was not included in the optimization problem. In this section we will explain why it can be approximately ignored. First, we will explain why it impossible to include the prefix in the optimization method described in the last section. In order to know what is the

28

CHAPTER 2. CLASS COMPRESSION

length of the prefix that is required for a certain class, it is not enough to know which symbols belong to that class, as is the case with the index. To correctly compute the prefix, we need to know the weight of all the classes. This fact makes it impossible to accurately include the length of the prefix in any algorithm that finds the optimal solution in a step-by-step manner. Moreover, the relationship between symbol probability and their code length (using Huffman encoding) is highly nonlinear ([65]), which renders numerical optimization methods unuseful. However, we claim that under certain conditions, ignoring the prefix has negligible effect. Let ρ be the redundancy of the prefix code. We write equation (2.7), but instead of using πi notation for the bit length of each prefix, we add the entropy of the prefixes:

0

PN

i=1

R =

Pi log2 |Ci | + PN +1 · B + B · B

PN

i=1

|Ci | + (−Hpref ix + ρ)

(2.27)

where

Hpref ix = −

N +1 X

Pi · log2 Pi

(2.28)

i=1

expanding Hentropy we get

PN R=

i=1

Pi (log2 |Ci | + log2 Pi ) + PN +1 (B + log2 PN +1 ) + B · B

PN

i=1

|Ci |

ρ B (2.29) +

2.3. THE OPTIMAL CLASS STRUCTURE PROBLEM

29

the first part of the equation has the form of equation (2.7), thus we can write: ˜+ ρ R=R B

(2.30)

˜ is the “ideal” compression rate - when the prefix coding hits the entropy R bound. ρ, as defined, is the redundancy of Huffman code. A basic property of such code is ρ ≤ 1 ([65]). This gives us a loose, but absolute bound on the redundancy (due to the prefix only) of class coding using Huffman coded prefixes:

ρclass ≤

1 B

(2.31)

We want to improve this bound further, but here we will need to add some assumptions. The extensive research on the redundancy of Huffman code reveals that if the probability of the most probable symbol is below 0.5, the upper bound reduces dramatically([65]). In natural languages, the probability of the most frequent symbol is well below that threshold (see, for example, in the English language [6]). Moreover, calculating the cumulative probability of some of the most frequent characters, still yields probability that is much smaller than 0.5. Remembering the tendency of the optimization algorithm to gather symbols with similar probabilities leads us to the assumption that in such application the redundancy would be negligible and the optimization algorithm can ignore the prefix safely.

30

CHAPTER 2. CLASS COMPRESSION

2.3.4

Complexity

The complexity of the general Bellman-Ford algorithm is O(K 3 ). In our case, however, we make at most N iterations, thus reducing the complexity to O(N · K 2 ). Inspecting carefully the structure of the arcs in the graph, we find that the number of arcs going into node Vk is at most log2 K, so by removing unnecessary comparisons, the time complexity can be reduced even further to O(N · log2 K · K). The space complexity will be discussed later, and it is O(N · K).

2.3.5

Example

As an example we will use the following text:

HDFCABACDFCCEEDCEBCBACDGDCCACACCEAEEE The statistics of this text can be easily extracted (see table 2.1). We are interested in coding this text using 2 classes (N = 2). Using (2.26), we build a graph for the optimization problem (for clarity, not all arcs are shown), which is shown in figure 2.2. It can be immediately verified, that the emphasized path is the shortest when using only three arcs at total (one for each class and one final arc for the literal class). The following tables summarizes the final coding. Table 2.2 summarizes the class configuration and table 2.3 gives the coding for each symbol in the compressed text. Counting the bits in both the uncompressed and the compressed text, reveals that the original representation required 111 bits whereas the latter

2.3. THE OPTIMAL CLASS STRUCTURE PROBLEM

Symbol C E A D B F G H

Prob. Code 12 010 7 100 6 000 5 011 3 001 2 101 1 110 1 111

Table 2.1: Symbol Probabilities for the example

Class C1 C2 C3

|Ci | Pi Prefix Index Len. Members 1 12 00 0 C 2 13 1 1 E, A 5 12 01 – D, B, F, G, H

Table 2.2: Class Assignment for the example

Symbol C E A D B F G H

Code 00 10 11 01 011 01 001 01 101 01 110 01 111

Table 2.3: Final Symbol Coding

31

32

CHAPTER 2. CLASS COMPRESSION

Figure 2.2: Class Structure Optimization Example The selected path is emphasized requires only 110 bits. Several points in this example should be noticed: • The algorithm tends to gather symbols that have close probability values. The difference between p1 to p2 is big, but p2 and p3 are very close, so they were assigned to their own class. p4 , however was not assigned it C2 even though it is close to p2 and p3 , since that would require adding p4 to C2 as well, and the price is too high. • Equation (2.7) can be verified here. The compression attained is rather unimpressive (the compressed message is one bit shorter than the uncompressed message), this is due to the small number of classes.

2.3.6

Implementation

The implementation starts with the definition of three arrays: f[k] which holds pk ; u[l] is the shortest path from the first node to the l’th node; and trace[l,j] is the number of node through which the shortest path to node

2.4. EXPERIMENTAL RESULTS

33

j in the l’t step passes. We initialize u[l] with the w1,l according to ((2.26)) which is the shortest path from the first node to the l’th node using one step only. If there is no arc between V1 and Vl , u[l] should be given “infinite” value (in practice MAXINT can be used). Next, we run the following loop:

for l=2 to N+1 for j=2 to K for i=2 to K u[j] ⇐ min(u[i]+ wi,j ) trace[l,j] ⇐ i for which u[j] is minimal

After completing, u[K+1] will be the value of the shortest path (which is the length of the compressed message without the prefixes). trace[l,j] can help us recover the nodes through which that path has passed and that will give us the class structure. The size of the trace array governs the space complexity of the algorithm, which is O(N · K).

2.4

Experimental results

The algorithm described in the previous sections was implemented on a PC system and tested for compression performance. The texts used are taken from sources that are commonly used to evaluate compression such as the

34

CHAPTER 2. CLASS COMPRESSION Description Size calgary The Calgary corpus 3.1MB world95 CIA World factbook 95 2.9MB canterbury The Canterbury corpus 2.66MB

Reference [9] [63] [10]

Table 2.4: Datasets for compression experiments

Calgary and the Canterbury corpuses. Comparative data is also available for some of these texts using previously introduced compressed matching algorithms (that were described on Chapter 1). Table 2.4 summarizes the text datasets.

2.4.1

Implementation Issues

One of the most important parameters in an entropy coding system, is the model used to extract the probability information for the symbols. An indepth discussion of model choice for text compression is beyond the scope of this work and can be found, for example, at [6]. The model used here is a simple zero-order model, using character-pair as symbols. It is to be noted, that more elaborated models can be used. The original CodePack, for example, divided each 32 bit word into two strictly alternating alphabets, coding each separately. When we will deal with the string matching problem, such simple model would be imperative. Another issue is the transmission of the codebook. In order to improve compression, a new codebook is selected for every file, and a method must be established to transfer it. The method we used sends a small header for each class (containing its size, prefix etc.) and the a list of symbols in that class. This method is optimized for a relatively small number of class. Thus,

2.4. EXPERIMENTAL RESULTS

35

when we increase the number of classes, heading for the Huffman limit, the codebook becomes less efficient, and the compression ratio drops.

2.4.2

Results

The first variable we would like to examine is the number of classes. In section 2.2 we have claimed that the compression ratio rises monotonically as the number of classes increases. Figure 2.4.2 shows a single text (world95.txt) compressed with varying number of classes. The dashed line is the length of the text compressed using Huffman code and the approach towards the latter is seen. Figure 2.3 show compression results for several files selected from the datasets. The figure shows that when the number of classes is small, increasing it has a big impact on the result, whereas from some point the curve flattens and increasing the number of classes will no more yield a real improvement in compression. This behavior is consistent to all the texts show.

36

CHAPTER 2. CLASS COMPRESSION

! !

Figure 2.3: Compression ratio for some files from the testset with different number of classes

Figure 2.4: Approaching the Huffman limit (X axis is logarithmic for convenience)

Chapter 3 Code Compression for Embedded Systems 3.1

System Parameters

When considering a code compression application, several design parameters have to be decided on concerning the compression rate, decompression speed and chip space requirements and the tradeoffs among them. The goal of this chapter is to explore the influence of some design parameters of the class compression system on these factors.

3.1.1

Probability Model

In order to apply an entropy encoding system, a proper probability model must be used. The probability model determines the probability that the next symbol (in the string to be compressed) is σ, for all σ ∈ Σ. Models are characterized by their order. A zero order model specifies a constant proba37

38 CHAPTER 3. CODE COMPRESSION FOR EMBEDDED SYSTEMS bility to each symbol, regardless of the previous (or future) symbols. This is similar to a source emitting random symbols at some specified probabilities. N -th order model assigns a probability to each symbol depending also on its context, which is the previous N symbols. Higher order models have the potential to provide better compression ratio, since they capture more of the behavior of the text. Higher order models, however, are more complicated and require more time to build. In code compression application, the decompression hardware is in the critical path from the memory to the execution units. Thus, it must perform its function very fast, which means that high order models are not practical. Three models where considered for our experiments:

• Full instruction. In this model, every 32 bit instruction was considered as one atomic unit. The probability of each word was determined by counting. This model is designated as “Model 32” in the results.

• Half instruction. In this model, every symbol consists of 16-bit which is half instruction. There is no distinction between the lower and the higher halves. This model is designated as “Model 16” in the results.

• High/low separation. This is similar to the half instruction model, but the high and low part are treated separately, each having its own probability set. This is the model being used by CodePack. Comparing the other models, it requires to approximately double the hardware. This model is designated as “Model 16/16” in the results.

3.1. SYSTEM PARAMETERS

3.1.2

39

Codebook Limit

The codebook is required when decompressing the code in runtime. Every code word pass through it (except literals) in order to get translated to the uncompressed word. The codebook must be available at all times, and must have fast access to minimize delays in the decompressing process. To achieve this, it is implemented as fast RAM in the decompression core. RAM has cost space relative to its size, and thus it is desirable to have some facility to limit its size. In the optimization algorithm presented in the last chapter there was no such facility (the codebook size is determined along with the class structure to obtain maximum compression). Luckily, only small modification is required. The codebook contains all the codes that are not in the literal class. If we look at the graph representation of the problem, we notice that the size of the literal class is determined by the node from which the (selected) final ˜ symbols, we must arc extends. If we want to limit the codebook size to K ˜ nodes. This is make sure that the final arc extends from a node not after K ˜ + 1. easily accomplished by removing all the arcs from node K Obviously, the class structure selected by the new graph may be suboptimal in comparison with the unlimited graph, but it is reasonable to expect degradation in compression performance after imposing a new constraint.

3.1.3

Static vs. Dynamic Structure

In order to achieve maximum compression, a new class structure have to be calculated for every code image. Such method requires the hardware to

40 CHAPTER 3. CODE COMPRESSION FOR EMBEDDED SYSTEMS be capable of adjusting itself to new class structures at runtime (A possible implementation of such hardware can be found at [59]). On the other hand, a static structure, precomputed on some typical code, has less potential to achieve good compression, but can be implemented in smaller and faster fixed hardware. CodePack takes the last approach. Its class structure is shown in Figure 3.1. In order to determine the class structure for a static configuration, some method must be applied. We have tried three methods (in the description below, the term characteristic set is a set of programs that are typical on the subject architecture):

• Averaging. In this method, we calculate the optimal class configuration for every program on the characteristic set and then average the length of all the classes (i.e. all first classes, all second classes etc.). The result is then rounded to the nearest integral power of two.

• Most Frequent. Again, we calculate the optimal structure for all the programs, and for every class we select the most frequently occurring length.

• Set-optimal. In this method, every optimal class structure is used to compress all the other programs in the characteristic set, and the structure that gives the best result in average is selected.

3.1. SYSTEM PARAMETERS

Figure 3.1: Class Structure of CodePack

41

42 CHAPTER 3. CODE COMPRESSION FOR EMBEDDED SYSTEMS

Figure 3.2: Comparison of Probability Models for Code Compression

3.2

Experimental Results

Our experiments were performed on a set of programs from the SPEC2000[58] suite. The programs were compiled to the Alpha architecture, and the code section were extracted from the object files using GNU binary utilities. Figure 3.2 shows the influence of the probability model on compression. The parameters used were N = 6 (similar to that of CodePack) with no codebook limit. The three models that have been discussed of are shown, along with compression rate attained by the ZIP compression utility. It is a general purpose compression software base on Lampel-Ziv algorithm, and it is given as a reference since it generally attains good results (ZIP files, however, contain some management overhead). The results show that the 16/16 model provide the best results. Another conclusion is that the 0-order

3.2. EXPERIMENTAL RESULTS

43

Figure 3.3: Comparison of Dictionary Sizes for Different Models

approximation is reasonable for this kind of application (since ZIP results are not much better). Figure 3.3 compares dictionary size for these models. The 32 bit model requires great amount of dictionary space, which is a major disadvantage. Figures 3.4 and 3.5 show compression ratio results for various codebook limit values. The mapping between the codebook limit to the compression performance is immediate – as the codebook limit gets tighter, less compression is attained. It is to be noted, that when the codebook size is not limited, it does not grow to include all the symbols, since from some point on, adding symbols to the dictionary only increases the size of the overall compressed text. The optimization algorithm determines the optimal codebook size thereby calculating the optimal class structure.

44 CHAPTER 3. CODE COMPRESSION FOR EMBEDDED SYSTEMS

Compression Ratio [%]

85 80 75 70 65 60 0

10

20

30

40

50

60

Number of Classes D=128

D=1024

D=4096

UNLIMITED

Figure 3.4: The Effect of Codebook Limit (with varying N )

Compression Ratio [%]

85 80 75 70 65 60 0

4096

8192

12288

Dictionary Size [words]

Figure 3.5: The Effect of Codebook Limit (fixed N )

16384

3.2. EXPERIMENTAL RESULTS

45

!"

Figure 3.6: Static Class Configuration The last figure, 3.6, shows compression results for static class configuration. The class configuration was determined using the methods described in the previous section, then applied to some files from the set (the characteristic set was all the programs from SPEC2000). The results show that using static class structure has very little penalty, and thus it is a feasible design option.

46 CHAPTER 3. CODE COMPRESSION FOR EMBEDDED SYSTEMS

Chapter 4 Multi-resolution string matching 4.1

Low resolution text

Let

T = t1 t2 t3 ...t`

(4.1)

be a string (text) over Σ (as defined in equation (2.3)). The low resolution image of T is defined as:

Tˆ(T ) = tˆ1 tˆ2 ...tˆ`

(4.2)

ˆ = {ˆ Σ σ1 , σ ˆ2 , ..., σ ˆN , σ ˆN +1 }

(4.3)

over the alphabet

47

48

CHAPTER 4. MULTI-RESOLUTION STRING MATCHING

where tˆi = σ ˆj , ti ∈ Cj

(4.4)

Cj ∈ C is a class, as defined in chapter 2. Following this definition, Tˆ has three important properties: • Uniqueness. For any text T , there is one and only one low resolution image Tˆ. This property is a direct corollary of the uniqueness property of the class definition (2.2). The reverse direction is not possible, for any low resolution image, there exists many texts that would produce it. • Existence. A low resolution image exists for any text. This property is a direct corollary of the cover property (2.1). • Information content. The information content of Tˆ is lower than the information content of the text that was used to produce it, although the number of symbols is identical in both. This is due to the fact that ˆ < |Σ|. This means that less bits are required to represent the low |Σ| resolution image of the text than the actual text. In order to recover a text from its low resolution image, more information needs to be supplied. This information will be called resolving information. A text represented as a low resolution image and resolving information, will be called multi-resolution text coding. It is to be noted, that the process can be repeated several times for the text, which will yield a coding with several levels of resolution. From this point on, we will restrict ourselves for two-level coding only.

4.2. A STRING MATCHING ALGORITHM

49

Figure 4.1: Multiple resolution text using class coding The class coding, presented in the previous chapter, is one possible way to code a text in multi-resolution fashion. Encoding a text, class coding produces a stream of prefix-index pairs. Taking the prefixes only, we get the desired low-resolution image. In order to decode the text completely, we need the indices, which are used as resolving information. Figure 4.1 illustrates the process using the text from section 2.3.5.

4.2 4.2.1

A string matching algorithm Definition

Let S = s1 s2 s3 ...sr , (4.5) sj ∈ Σ

50

CHAPTER 4. MULTI-RESOLUTION STRING MATCHING

Figure 4.2: Class-compressed text stream format be a string. We wish to find the first (all

1

) occurrences of S in T. T,

however, is given in class compressed form, ordered as two separate blocks - first comes the prefix block, then the index block, as illustrated in figure 4.2. The simplest way to accomplish the task is to extract the compressed file into some temporary storage, and then to apply known algorithms. As discussed in the first chapter, this method has several drawbacks. The following algorithm is able to search the compressed file without decompressing it first: • Step 1. Using equation (4.4) and the class transformation G used to ˆ build T , construct a low resolution image of S, S(S). • Step 2. Find all the occurrences of Sˆ in Tˆ (where Tˆ is the prefix block 1

The two problems are identical. We will refer only to finding all the occurrences.

4.2. A STRING MATCHING ALGORITHM

51

of the compressed text T ) • Step 3. For any occurrence found, decode T locally and check weather S actually occurs.

4.2.2

Details

In this section, we will focus on a detailed description, feasibility and correctness of the algorithm. A discussion on performance and complexity will be given in the following sections. In the first step of the algorithm, a low resolution image of S has to be built. It is done by coding S using the same class structure used to encode T and discarding the indices. This class structure must be known, either by including it in the compressed stream or by any other mean (such as ˆ is unique, exactly as the static class structure). The resultant string (over Σ) compressed form of T is unique for T . At the second step, Sˆ is searched in Tˆ. Since the prefixes are Huffman coded, this step includes a search within a variable-length code. This fact means that the usual methods for accelerating searches (such as BoyerMoore[7] or Knuth-Morris-Pratt[31]) cannot be used. Moreover, each prefix also carry information about the length of the index in its prefix-index pair. As we shall see later, it is necessary to decode the index lengths from the ˆ Therebeginning of the prefix block up to each possible occurrence of S. fore, the search strategy must be serially decoding all the prefixes. It is to be noted, that decoding all the prefixes of T (its low resolution image) is significantly less costly than decoding the complete text, since the number

52

CHAPTER 4. MULTI-RESOLUTION STRING MATCHING

of possible prefixes is much smaller than the number of symbols, and their Huffman codes are shorter accordingly. The result of this step would be a list of matches. This list, however, is valid only for the low resolution string. Since a single low resolution string can be produced by many full resolution strings, this list may contain false matches. The uniqueness property, on the other hand, asserts that if S appears in the text, Sˆ must appear in Tˆ. Thus, another step is required to determine which of the occurrences of Sˆ is a true occurrence of S. To do that, we need the resolving information stored as indices. One possible way is to decode the indices, starting at the first symbol up to the occurrence found. This way, however, means completely decoding T , which is not desirable. Rather, it is possible to access the index block “randomly”, start decoding at the occurrence and decode only as much symbols as required. The indices, like the prefixes are stored as variable-length code. The length of each index is, however, known at an earlier stage – when its prefix was decoded. To access the i’th index, we only need to decode the prefixes of s1 ...si and accumulate the length of index paired with every prefix. That way, the exact location of the index of si within the index block is known without having to decode all the indices along the way. After the indices have been checked, and the occurrence of the string being sought was confirmed, it can be marked as an occurrence. The process is illustrated in figure 4.3. In part (a) of the figure, the text is shown (the text, the classes and the symbol codes are taken from the example at section 2.3.5) and the low resolution image is given just below. The numbers 1, 2 and 3 denote here that the symbol belongs to C1 , C2 or C3 .

4.2. A STRING MATCHING ALGORITHM

53

This information is stored as an integral part of the compressed form of T , it needs not be calculated before any search. Part (b) shows the string sought S, and its low resolution image. In contrast to the former, this image needs to be calculated before any search (when changing S). The class structure used is exactly the one that was used to encode T , which means that the optimization process need not be repeated. The next step is to search all the occurrences of Sˆ in Tˆ. These occurrences are interlined in part (a). Sˆ appears 3 times in Tˆ from which only one is a true match. The process of resolving the occurrences is shown in part (c). Decoding the prefix block, yields not only the prefixes but also the length of the index associated with each prefix. This length is shown over every prefix. The first Sˆ occurs after 3 symbols, and accumulating the index length of the first three symbols provides a pointer into the index block. From this point, three symbols (the length of S) have to be decoded to determine that this is a true occurrence. ˆ but reThe same procedure continues on to the next occurrences of S, solving the original text on these cases reveals that these are false matches.

4.2.3

Double-byte symbols

In order to improve the compression performance, every symbol, as presented in the previous chapter, comprises of two characters. This poses a problem for the search algorithm, since the basic symbol of the string being searched is only one character long. This may cause ambiguity since every singe string can be encoded in two different ways.

54

CHAPTER 4. MULTI-RESOLUTION STRING MATCHING

Figure 4.3: Illustration of string matching procedure

4.3. ANALYSIS

55

Consider, for example, the following text:

TODAY IS THE DAY It is encoded as double-bytes in the following way: TO DA Yt IS tT HE tD AY σ 1 σ2

σ3

σ4

σ5

σ6

σ7

σ8

We wish to search the string DAY within it. The na¨ıve approach will encode this string as σ2 σ3 . There might be, actually, several possibilities for the last symbol, so it can be simply dropped (remembering to match the last character in some later stage). This choice, however, will miss the second occurrence of the string. The solution to this situation is to perform two searches: the first will be done using the said conversion; the second one will start matching the target string starting from the second character. In the last example, the first search would try to find the sequence σ2 σ3 and the second will try to find the sequence σ8 . Combining the results from both searches will yield the desired result.

4.3

Analysis

The common tool for estimating the performance of algorithms is the asymptotic complexity. Reviewing the algorithm described, we see that in the first step, the prefix block, which is ` symbols long, have to be searched for all the occurrences of a single string. The complexity of such algorithm is O(` + r) at the worst case (assuming the use of a slightly more intelligent algorithm

56

CHAPTER 4. MULTI-RESOLUTION STRING MATCHING

than brute force). In the second step, in the worst case (all the characters are decoded), another brute force search is performed, thus the overall complexity is O(` + r). This places the algorithm in the efficient category due to Amir and Benson[1] taxonomy. From the complexity point of view, the algorithm is efficient (since it’s polynomial time) and it is no faster than a brute force search (in the uncompressed text). This is, however, not the complete picture. The stage of prefix decoding consists of sequential pass over the compressed file, whereas the stage of index decoding is base on random access. In most cases (when the file is stored on devices such as disc, CD-ROM etc.) random access is much slower. Therefore, the real performance measure is the rate of false matches, i.e. matches in the low resolution part that are not matches in the text. This phenomenon cannot be observed by the asymptotic complexity tool, since it ignores constants. In order to estimate the rate of false matches, a program was designed to implement the algorithm and measure this value (along with the search time). The test methodology follows the one presented in [29]. For each tested file (a subset of the files in table 2.4), random string were selected having length from 4 to 12 symbols, 10 of each. Though [29] have selected even longer strings, it would not have much meaning here because a) the false matches rate falls sharply as the searched string gets longer; and b) such long string are uncommon. The results of the experiments are given below. Finally, some actual search time measurements are made.

4.4. EXTRA SPACE

4.3.1

57

Relation of number of classes to search speed

Increasing the number of classes has two effects. The first is increasing the number of symbols in the low resolution text. The outcome of this fact is that the low-resolution part becomes closer to the full text and the false match rate is expected to drop. On the other hand, the increasing number of symbols in the prefix block requires more bits to represent, and the size of the prefix block increases. Increase in the prefix block size will increase the execution time of the first stage. The conclusion is that there is expected to be a number of classes, which is optimal in the context of search time. The location of the optimum depends heavily on the access time to the device where the compressed text is stored. It is therefore also highly dependent on speedup techniques commonly employed, such as caching and buffering. A possible search system configuration is when the index is stored on a fast but space-limited device (local disk or memory) and the index block is stored on larger, but slower, device. In such system, the size of the prefix block is of greater importance and N can be controlled to achieve a desired size.

4.4

Extra space

While running, the search program needs no extra space, that is dependent on the length on the uncompressed file. The only memory required is to accelerate decoding of the prefix’s Huffman code and searching through them.

58

CHAPTER 4. MULTI-RESOLUTION STRING MATCHING

4.5

Experimental Results

4.5.1

Implementation Notes

Several points in the implementation of the algorithm worth noting: • Prefix Decoding. The method used to decode the prefix block is byte oriented. Each prefix is aligned to a byte boundary and then fed into a lookup table (containing 256 entries) which decodes the prefix at the beginning of the byte and outputs the class and its actual length. The length is then used to determine the location of the next prefix and the class continues to the low-resolution string matching algorithm. The table has constant length, and it is built in constant time. The only drawback is that the length of the longest prefix is limited to 8 bits (we can, of course, use a larger table). In practice, however, this limit is sufficient. A different way to overcome this limitation is to allocate the prefix codes using length-limited algorithm (for instance [35] or [57]) which will come in expense of compression. • String Matching Algorithm. The string matching algorithm in concern is the one used to find the low resolution image of S in the low resolution section of T . As the prefixes are decode one-by-one and serially, no benefit can be gained from character skipping algorithms such as BoyerMoore. On the other hand, using the simple brute force algorithm can lead to unnecessarily long search times. The solution is an algorithm that matches the target pattern within a text using a deterministic finite automaton (DFA). This is a two stage process. In the first step

4.5. EXPERIMENTAL RESULTS

59

the automaton is built (the algorithm is described in [48] and has a reference implementation in [11]) and then every new symbol can be fed to it. If it enters the final state, a match was found. • Measuring Execution Time. In order to measure time correctly and accurately, all the programs where run under DOS operating system. To prevent inaccuracies due to I/O access, the tested files where copied as whole to the memory and the processing time was measured for execution on that copy. The timing results are treated as simulation times, and the time measurement units where not specified deliberately. The results are for comparison purpose only and should not be interpreted as absolute times.

4.5.2

Results

Figure 4.5.2 shows the compression results of some text files used in the experiments with ZIP. Unlike the code compression application, natural language has more redundancy built-in ([14]), and the usage of 0-order model cannot exploit them, thus ZIP (which does) gives much better results. Figures 4.5, 4.6 and 4.7 shows the results obtained from the search runs described in the previous chapters. The graph in figure 4.5 shows the rate of false matches, as function of the number of classes (N ) and the length of the searched pattern. Y axis is given as percent of the all the substrings of Tˆ that are false matches. It is evident that the false match rate drops with the increasing length and number of classes. The drop due to the number of classes is attributed to the fact that

60

CHAPTER 4. MULTI-RESOLUTION STRING MATCHING

as the number of possible symbols (in the prefix) rises, the probability to find an arbitrary string of symbols (that is, where they do not result from a true match) drops. The dependency of search speed by the pattern length exists in all search algorithms, but in this algorithm it is much more emphasized. Again, this is due to the fact that the probability of finding some arbitrary string of symbols drops sharply as the length increases. Eventually, the only prefix string that will match the pattern being searched is the true match, and the number of false matches will drop to zero. The graph in figure 4.6 shows the search time depending on the same variables. The resemblance between this graph and the previous graph is salient. This fact supports the claim the the search time depends mainly on the false matches rate. The last graph shows the the relation between the search time and the rate of false matches. Again, the relation is clear. The graph in figure 4.8 show the relation of search time versus changing the number of classes. This graph was taken without operating system caching. The expected behavior of search time, as previously discussed, is evident from the graph. Under these terms, the optimal number of classes is about 10. The file that was searched is world95 and the string searched was 6 character long. Finally, the graph 4.9 shows a comparison between our algorithm and the lzgrep program[45]. The graph shows that in some cases, our algorithm slightly outperforms lzgrep, even though the latter uses the fast BoyerMoore method. It comes however in the expense of compression (the LZcompressed file is about 20 percent smaller).

61

!

4.5. EXPERIMENTAL RESULTS

"# $ % &'())* (+,

Figure 4.4: Class coding compared with other compression algorithms

0.18 0.16 0.14 r=4 r=6 r=10 r=12

0.12 0.1 0.08 0.06 0.04 0.02 0 0

5

10

15

20 N

Figure 4.5: False matches results

25

30

35

62

CHAPTER 4. MULTI-RESOLUTION STRING MATCHING 0.0016 0.0014 0.0012 0.001

r=4 r=6 r=10 r=12

0.0008 0.0006 0.0004 0.0002 0 0

5

10

15

20

25

30

N

Figure 4.6: Search time results 0.002 0.0018 0.0016 0.0014 0.0012

N=4 N=6 N=10

0.001 0.0008 0.0006 0.0004 0.0002 0 0.25

0.2

0.15

0.1

0.05

False Match Ratio

Figure 4.7: Search time versus false matches

0

35

4.5. EXPERIMENTAL RESULTS

Figure 4.8: Optimal selection of N

Figure 4.9: Comparison of Search Times

63

64

CHAPTER 4. MULTI-RESOLUTION STRING MATCHING

4.6

Enhancement Possibilities

In this section, several possible enhancements to the base algorithm are presented. They were not implemented, tested or studied, but they are examples how this approach can be generalized. • Non-text files. Throughout the work, only text files were mentioned and considered. However, no feature has relied on this fact. Unlike other methods in this field of research ([42] for instance) the method presented here can be used without any changes for arbitrary file types, including text in various encoding schemes and binary files. • Approximate matching. Approximate pattern matching is a generalization of the (exact) pattern matching problem. In this problem, we try to find all the occurrences of some pattern within a text, but we allow some differences (errors) in the matches. The differences are usually expressed as some distance function and some maximum distance is allowed. When looking only at the first stage of our algorithm, i.e. the search in the low resolution part, we get this kind of algorithm exactly. The number of classes can be used to control the allowed distance. Moreover, a variation of this problem is the approximate counting problem, in which we are interested in counting the number of occurrences of the pattern. Yet again, by counting the number of the occurrences in the low resolution part such result is naturally obtained. • Regular expressions. Regular expression are expressions describing sets of strings. They are commonly being searched in texts. The usual

4.6. ENHANCEMENT POSSIBILITIES

65

way of handling these expression is to compile them into a deterministic finite automaton, into which the symbols of the text are being fed. This approach can also be applied to the low resolution part of the text (providing the appropriate translation) and then, in the same way described here, check the resolving information to find the desired results.

66

CHAPTER 4. MULTI-RESOLUTION STRING MATCHING

Chapter 5 Summary In this work we have dealt with several problems. The first problem was the class coding and its optimization. The method was introduced and an algorithm for optimizing the class structure has been developed. This algorithm is efficient and is proved to run faster than the brute force approach. Though the algorithm handles only the index part of the code, a heuristic suggesting that such method gives good results was given. The second problem is the the problem of compressed code execution. The benefits of class coding for such application were detailed, followed by experimental results demonstrating the influence of several system parameters on the final results. The improvement here, over the CodePack system (which uses class coding) is the ability to control the parameters easily and adapt this method to processors other than PowerPC. The third and last problem is the problem of compressed matching. This problem, gaining popularity in the recent time, was defined and a novel algorithm, based on class compression has been suggested. The algorithm is 67

68

CHAPTER 5. SUMMARY

based on a decomposition of the text into two parts – a low resolution part and resolving information. Unlike in the image processing area (from which the term ”low resolution image“ is taken), the low resolution part has little meaning by itself (the original text can not be recovered by this part alone), but it is very helpful in fast filtering of the text, such that the actual search can be made fast. The chapter has concluded with experimental results showing search times.

Bibliography [1] A. Amir and G. Benson. Efficient two-dimensional compressed matching. In Proc. Second IEEE Data Compression Conference, pages 279–288, 1982.

[2] A. Amir, G. Benson, and M. Farach. Let sleeping files lie: Pattern matching in z-compressed files. Journal of Computer and System Sciences, 52:299–307, 1996.

[3] A. Amir, G.M. Landau, and U. Vishkin. Efficient pattern matching with scaling. In Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms, pages 344–357, 1990.

[4] A.Orpaz and S. Weiss. A study of codepack: Optimizing embedded code space. In Proceedings of the Tenth International Symposium on Hardware/Software codesign, pages 103–108, 2002.

[5] A.Orpaz and S. Weiss. Pattern matching by means of multi-resolution compression. In Proceedings DCC’99 Data Compression Conference, page 441, 2003. 69

70

BIBLIOGRAPHY

[6] T. C. Bell, J. G. Cleary, and I. H. Witten. Text Compression. Prentice Hall, New Jersey, 1990. [7] J. Boyer and S. Moore. A fast string searching algorithm. Communications of the ACM, 20, 1977. [8] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corp, 1994. [9] The Calgary Corpus. http://links.uwaterloo.ca/calgary.corpus.html. [10] The canterbury corpus. http://corpus.canterbury.ac.nz. [11] C. Charras and T. Lecroq.

Exact String Matching Algorithms.

http://www-igm.univ-mlv.fr/˜lecroq/string/index.html. [12] I.C. Chen.

Enhancing Instruction Fetching Mechanism Using Data

Compression. PhD thesis, University of Michigan, 1997. [13] I.C. Chen, P. Bird, and T. Mudge. The impact of instruction compression on I-cache performance. Technical Report CSE-TR-330-97, EECS Department, University of Michigan, 1996. [14] M. Crochemore, F. Mignosi, A. Restivo, and S. Salemi. Text compression using antidictionaries. Lecture Notes in Computer Science, 1644:261– 270, 1999. [15] T. Eilam-Tzoreff and U. Vishkin. Matching patterns in strings subject to multi-linear transformations. Theoretical computer science, 60(3):231– 254, 1988.

BIBLIOGRAPHY

71

[16] J. Ernst, C.W. Fraser, W. Evans, S. Lucco, and T.A. Proebsting. Code compression. In Proc. Conf. on Programming Languages Design and Implementation, pages 358–365, June 1997. [17] M. Farach and M. Thorup. String matching in Lampel-Ziv compressed strings. In Proc. of the twenty-seventh annual ACM sympoium on Theory of computing (STOC), pages 703–712, 1995. [18] M. Franz and T. Kistler. Slim binaries. Communications of the ACM, 40(12):87–94, 1997. [19] C.W. Fraser and T.A. Proebsting. Finite-state code generation. In Proc. Conf. on Programming Languages Design and Implementation, pages 270–280, May 1999. [20] M. Game and A. Booker. CodePack: Code Compression for PowerPC Processors. International Business Machines (IBM) Corporation, 1998. [21] L. G¸asieniec and W. Rytter. Almost-optimal fully LZW-compressed pattern matching. In Proceedings DCC’99 Data Compression Conference, 1999. [22] J. Hoogerbrugge, L. Augusteijn, J. Trum, and R. van de Wiel. A code compression system based on pipelined interpreters. Software - Practice and Experience, 29(11):1005–1023, 1999. [23] D.A. Huffman. A method for the construction of minimum redundancy codes. Proc. IRE, 40(9):1098–1101, September 1952.

72

BIBLIOGRAPHY

[24] IBM. CodePack: PowerPC Code Compression Utility User’s Manual. Version 3.0. International Business Machines (IBM) Corporation, 1998. [25] T.M. Kemp, R.M. Montoye, J.D. Harper, J.D. Palmer, and D.J. Auerbach. A decompression core for PowerPC. IBM Journal of Research and Development, 42(6):807–812, Nov 1998. [26] T. Kida, M. Takeda, A. Shinohara, and S. Arikawa. Shift-and approach to pattern matching in LZW compressed text. In Proceedings of the 10th Annual Symposium on Combinatorial Pattern Matching, pages 1– 13. Springer-Verlag, 1999. [27] T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa. Multiple pattern matching in LZW compressed text. In Data Compression Conference, pages 103–112, 1998. [28] K. Kissell. MIPS16: High-density MIPS for the Embedded Market. Silicon Graphics MIPS Group, 1997. [29] S. Klein and D. Shapira. A new compression method for compressed matching. In IEEE Data Compression Conference (DCC), pages 400– 409, Snowbird, Utah, March 2000. [30] S. Klein and D. Shapira. Pattern matching in Huffman encoded texts. In Proceedings of the Data Compression Conference, 2001. [31] D.E. Knuth, J.H. Morris, and V.R. Pratt. Fast pattern matching in strings. SIAM J. of Computing, 6:323–350, 1977.

BIBLIOGRAPHY

73

[32] M. Kozuch and A. Wolfe. Compression of embedded system programs. In Proc. Int’l Conf. on Computer Design, 1994. [33] M. Kozuch and A. Wolfe.

Performance analysis of the compressed

code RISC processor. Technical Report CE-A95-2, Princeton University Computer Engineering, 1995. [34] S.Y. Larin and T.M. Conte. Compiler-driven cached code compression schemes for embedded ILP processors. In Proc. Int’l Symp. on Microarchitecture, pages 82–92, November 1999. [35] L. Larmore and D.S. Hirschberg. A fast algorithm for optimal lengthlimited Huffman codes. Journal of the ACM, 37(3):464–473, Jul 1990. [36] E.L. Lawler. Combinatorial Optimization. Holt, Rinehart and Winstone, July 1976. [37] A.C. Lear.

Shedding light on embedded systems.

IEEE Software,

16(1):122–125, January/February 1999. [38] E.A. Lee. What’s ahead for embedded software?

IEEE Computer,

33(9):18–26, September 2000. [39] C. Lefurgy, P. Bird, I.C. Chen, and T. Mudge. Improving code density using compression techniques. In Proc. Int’l Symp. on Microarchitecture, pages 194–203, December 1997. [40] H. Lekatsas, J. Henkel, and W. Wolf. Code compression for low power embedded system design. In Proceedings of the 37th conference on Design automation, pages 294–299, 2000.

74

BIBLIOGRAPHY

[41] S. Liao, S. Devadas, and K. Keutzer. A text compression based method for code size minimization in embedded systems. ACM Transactions on Design Automation of Electronic Systems, 4(1):12–38, January 1999. [42] U. Manber. A text compression scheme that allows fast searching directly in the compressed file. In Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching, pages 113–124. SpringerVerlag, Berlin, 1994. [43] S. Mitarai, M. Hirao, T. Matsumoto, A. Shinohara, M. Takeda, and S. Arikawa. Compressed pattern matching for SEQUITUR. In Data Compression Conference, pages 469–, 2001. [44] E.S. De Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Fast and flexible word searching on compressed text. ACM Transactions on Information Systems, 18(2):113–139, April 2000. [45] G. Navarro and J. Tarhio. lzgrep – a direct compressed text search tool.

www.dcc.uchile.cl/gnavarro/software. [46] J. Rautio, J. Tanninen, and J. Tarhio. String matching with stopper encoding and code splitting. In Combinatorial Pattern Matching, 13th Annual Symposium, CPM 2002, Fukuoka, Japan, July 3-5, 2002, Proceedings. Springer, July 2002. [47] R.L Rivest. On the worst-case behaviour of string-searching algorithms. SIAM J. of Computing, 6(4):669–674, December 1977.

BIBLIOGRAPHY

75

[48] G. Rozenberg and A. Salomaa. Handbook of formal languages, vol. 1: word, language, grammar. Springer-Verlag New York, Inc., 1997.

[49] A. Said and W.A. Pearlman. Low-complexity waveform coding via alphabet and sample-set partitioning. In Visual Communications and Image Processing ’97, Proc. SPIE Vol. 3024, pages 25–37, Feb. 1997.

[50] B. Santo. Embedded battle royale. IEEE Spectrum, 38(12):36–41, December 2001.

[51] Y. Shibata, T. Kida, S. Fukamachi, M. Takeda, A. Shinohara, T. Shinohara, and S. Arikawa. Byte pair encoding: a text compression scheme that accelerates pattern matching. Technical Report DOI-TR-CS-161, Department of Informatics, Kyushu University, April 1999.

[52] Y. Shibata, T. Kida, S. Fukamachi, M. Takeda, A. Shinohara, T. Shinohara, and S. Arikawa. A Boyer-Moore type algorithm for pattern matching. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching. Springer, July 1999.

[53] Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. Pattern matching in text compressed by using antidictionaries. Lecture Notes in Computer Science, 1645:37–49, 1999.

[54] Graham A. Stephen. String Searching Algorithms. World Scientific, 1994.

76

BIBLIOGRAPHY

[55] M. Takeda. Pattern matching machine for text compressed using finite state model. Technical report, Department of Informatics, Kyushu University, October 1997. [56] J.L. Turley. Thumb squeezes ARM code size. Microprocessor Report, 9(4), March 1995. [57] D.C. Van Voorhis. Constructing codes with bounded codeword lengths. IEEE Transactions on Information Theory, 20(3):288–290, March 1974. [58] C.T. Weaver. Spec 2000 Binaries. www.eecs.umich.edu/ ˜chriswea/benchmarks/spec2000.html. [59] S. Weiss and S. Beren. Class-based decompressor design for compressed instruction memory in embedded processors. IEEE Transactions on Computers, 52(11):1495–1500, Nov 2002. [60] T.A. Welch. A technique for high-performance data compression. IEEE Computer, 17(6):8–19, June 1984. [61] S. Wojciech. Asymptotic average redundancy of Huffman (and other) block codes. IEEE Transactions on Information Theory, 46(7):2434– 2443, Dec 2000. [62] A. Wolfe and A. Chanin. Executing compressed programs on an embedded RISC architecture. In Proc. Int’l Symp. on Microarchitecture, pages 81–91, 1992. [63] The 1995 CIA World Factbook http://www.ibiblio.org/gutenberg/etext96/world95.txt.

.

BIBLIOGRAPHY

77

[64] Y. Xie, W. Wolf, and H. Lekatsas. A code decompression architecture for VLIW processors. In Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture, pages 66–75, 2001. [65] C. Ye and R.W. Yeung. A simple upper bound on the redundancy of Huffman codes. IEEE Transactions on Information Theory, 48:2132– 2138, July 2002.

Related Documents

Class Two
October 2019 12
Two Cheers For Class
June 2020 4
Coding
December 2019 22
Coding
July 2020 13