Recursive Hashing and One-Pass, One-Hash n-Gram Count Estimation Daniel Lemire ´ ´ Universite´ du Quebec a` Montreal
arXiv:0705.4676v1 [cs.DB] 31 May 2007
and Owen Kaser University of New Brunswick
Many applications use sequences of n consecutive symbols (n-grams). We review n-gram hashing and prove that recursive hash families are pairwise independent at best. We prove that hashing by irreducible polynomials is pairwise independent whereas hashing by cyclic polynomials is quasi-pairwise independent: we make it pairwise independent by discarding n − 1 bits. One application of hashing is to estimate the number of distinct n-grams, a view-size estimation problem. While view sizes can be estimated by sampling under statistical assumptions, we desire a statistically unassuming algorithm with universally valid accuracy bounds. Most related work has focused on repeatedly hashing the data, which is prohibitive for large data sources. We prove that a one-pass onehash algorithm is sufficient for accurate estimates if the hashing is sufficiently independent. For example, we can improve by a factor of 2 the theoretical bounds on estimation accuracy by replacing pairwise independent hashing by 4-wise independent hashing. We show that recursive random hashing is sufficiently independent in practice. Maybe surprisingly, our experiments showed that hashing by cyclic polynomials, which is only quasi-pairwise independent, sometimes outperformed 10-wise independent hashing while being twice as fast. For comparison, we measured the time to obtain exact n-gram counts using suffix arrays and show that, while we used hardly any storage, we were an order of magnitude faster. The experiments used a large collection of English text from Project Gutenberg as well as synthetic data. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Content Analysis and Indexing; H.2.7 [Database Administration]: Data warehouse and repository General Terms: Algorithms, Theory, Experimentation Additional Key Words and Phrases: recursive hashing, view-size estimation, n-grams
1.
INTRODUCTION
Consider a sequence of symbols ai ∈ Σ of length N. The data source has high latency: for example, it is not in a flat binary format or a DBMS, making random access and skipping impractical. The symbols need not be characters from a natural language: they can be particular “events” inferred from a sensor or a news feed, they can be financial or biomedical patterns found in time series, they can be words in a natural language, and so on. The The first author was supported by NSERC grant 261437 and the second author was supported by NSERC grant 155967. Authors’ addresses: Daniel Lemire, Universit´e du Qu´ebec a` Montr´eal, 100 Sherbrooke West, Montr´eal, QC H2X 3P2 Canada and Owen Kaser, University of New Brunswick, 100 Tucker Park Road, Saint John, NB E2L 4L1 Canada ACM Transactions on Computational Logic, Vol. V, No. N, February 2008, Pages 1–35.
number of distinct symbols (|Σ|) could be large (on the order of 105 in the case of words in a typical English dictionary), but it is assumed to be small compared to the amount of memory available. We make no other assumption about the distribution of these distinct symbols and n-grams. An n-gram is a consecutive sequence of n symbols. We use n-grams in language modeling [Gao and Zhang 2001], pattern recognition [Yannakoudakis et al. 1990], predicting web page accesses [Deshpande and Karypis 2004], information retrieval [Nie et al. 2000], text categorization and author attribution [Caropreso et al. 2001; Joula et al. 2006; Keselj and Cercone 2004], speech recognition [Jelinek 1998], multimedia [Paulus and Klapuri 2003], music retrieval [Doraisamy and R¨uger 2003], text mining [Losiewicz et al. 2000], information theory [Shannon 1948], software fault diagnosis [Bose and Srinivasan 2005], data compression [Bacon and Houde 1984], data mining [Su et al. 2000], indexing [Kim et al. 2005], On-line Analytical Processing (OLAP) [Keith et al. 2005a], optimal character recognition (OCR) [Droettboom 2003], automated translation [Lin and Hovy 2003], time series segmentation [Cohen et al. 2002], and so on.This paper concerns the use of previously published hash functions for n-grams, together with recent randomized algorithms for estimating the number of distinct items in a stream of data. Together, they permit memory-efficient estimation of the number of distinct n-grams and motivate finer theoretical investigations into efficient n-gram hashing. The number of distinct n-grams grows large with n: storing Shakespeare’s First Folio [Project Gutenberg Literary Archive Foundation 2007] takes up about 4.6 MiB but we can verify that it has over 3 million distinct 15-grams of characters. If each distinct n-gram can be stored using log(4.6 × 106 ) ≈ 22 bits, then we need over 8 MiB just to store the ngrams (3 × 106 × 22/8 ≈ 8.3 × 106 ) without counting the indexing overhead. Indexing data structures such as suffix arrays require even more storage. Thus, storing and indexing ngrams can use up more storage than the original data source. Extrapolating this to the large corpora used in computational linguistic studies, we see the futility of using brute-force approaches that store the n-grams in main memory, when n is large. For smaller values of n, n-grams of English words are also likely to defeat brute-force approaches. There are two strategies to estimate statistics of a sequence in a single pass [Kearns et al. 1994; Batu et al. 2002; Guha et al. 2006]. The generative (or black-box) strategy samples values at random. From the samples, the probability of each value is estimated by maximum likelihood or other statistical techniques. The evaluative strategy, on the other hand, probes the exact probabilities or, equivalently, the number of occurrences of (possibly randomly) chosen values. In one pass, we can randomly probe several n-grams so we know their exact frequency. For example, given the sequence “aadababbabaacc”, a random sampling might be “aabc”; from this P(a), P(b), and P(c) can be approximated. However, one could instead compute the exact frequency of value “b.” On the one hand, it is difficult to estimate the number of distinct elements from a sampling, without making further assumptions. For example, suppose there is only one distinct n-gram in 100 samples out of 100,000 n-grams. Should we conclude that there is only one distinct n-gram overall? Perhaps there are 100 distinct n-grams, but 99 of them only occur once — thus there is a ≈ 91% probability that we observe only the common one. While this example is a bit extreme, skewed distributions are quite common, as Zipf’s law shows. Choosing, a priori, the number of samples we require is a major difficulty. Estimating the probabilities from sampling is a problem that still interests researchers to this
day [McAllester and Schapire 2000; Orlitsky et al. 2003]. On the other hand, distinct count estimates from a probing are statistically easier [Gibbons and Tirthapura 2001]. With the example above, with just enough storage budget to store 100 distinct n-grams, we would get an exact count estimate! On the downside, probing requires properly randomized hashing. In the spirit of probing, Gibbons-Tirthapura (GT) [Gibbons and Tirthapura 2001] count estimation goes as follows. We have m distinct items in a stream containing the distinct items x1 , . . . , xm with possible repetitions. Let h(xi ) be pairwise-independent hash values over [0, 2L ) and let ht (xi ) be the first t bits of the hash value. We have that E(card({ht−1 (0)})) = m/2t . Given a fixed memory budget M, and setting t = 0, we scan the data. While scanning, we store all distinct items xi such that ht (xi ) = 0 in a look-up table H. As soon as size(H) = M + 1, we increment t by 1 and remove all xi in H such that ht (xi ) 6= 0. Typically, at least one element in H is removed, but if not, the process of incrementing t and removing items is repeated until size(H) ≤ M. Then we continue scanning. After the run is completed, we return size(H) × 2t as the estimate. By choosing M = 576/ε2 [Bar-Yossef et al. 2002], we achieve an accuracy of ε, 5 times out of 6 (i.e., P(|size(H) × 2t − m| > εm) < 1/6), by an application of Chebyshev’s inequality. By a Chernoff bound, running the algorithm O(log 1/δ) times and taking the median of the results gives a reliability of δ instead of 5/6. Bar-Yossef et al. suggest to improve the algorithm by storing hash values of the xi ’s instead of the xi ’s themselves, reducing the reliability but lowering the memory usage. Notice that our Corollary 2 shows that the estimate of a 5/6 reliability for M = 576/ε2 is pessimistic: M = 576/ε2 implies a reliability of over 99%. We also prove that replacing pairwise independent by 4-wise independent hashing substantially improves the existing theoretical performance bounds. Random hashing can be the real bottleneck in probing, but to alleviate this problem for n-gram hashing, we use recursive hashing [Cohen 1997; Karp and Rabin 1987]: we leverage the fact that successive n-grams have n − 1 characters in common. We study empirically online n-gram count estimation that uses one pass and hashes each n-gram only once. We compare several different recursive n-gram hashing algorithms including hashing by cyclic and irreducible polynomials in the binary Galois Field (GF(2)[x]). We characterize the independence of recursive hashing by polynomials whereas only empirical studies had existed. We prove that hashing by cyclic polynomials is not even uniform, whereas hashing by irreducible polynomials is pairwise independent. It is interesting to contrast these new theoretical results with the fact that Cohen reported [Cohen 1997] that hashing by cyclic polynomials provided a more uniform hashing, experimentally. Fortunately, we were also able to prove that hashing by cyclic polynomials is quasi-pairwise independent. The main contributions of this paper are an analytical survey of recursive hashing, a tighter theoretical bound in count estimation when hashing functions are more than pairwise independent, and an experimental validation to demonstrate practical usefulness and to suggest further theoretical investigation. 2.
ORGANIZATION AND CONTRIBUTION
In Section 4, we present multidimensional hashing and specifically, the problem of hashing n-grams. We review the concepts of uniform hashing as well as k-wise independence. We present recursive hashing as a fast way to hash all n-grams in a string by repeatedly updating the previous n-gram. We then prove that recursive hashing, according to our definition,
cannot be pairwise independent (see Proposition 1). Because we wish for pairwise independence, we introduce recursive hashing over hash values as weaker form of recursivity. We then prove that pairwise independence is possible, but no more (see Proposition 2). In Section 5, we present four hashing techniques for n-grams. We begin by showing how we can compute non-recursive n-wise independent hash values, as long as the number of distinct characters or words fits in memory. In Subsection 5.1, we present two hash families introduced by Cohen [Cohen 1997], C YCLIC and G ENERAL. Both involve very similar operations in Galois Fields (GF(2)[x]). We prove that G ENERAL is pairwise independent. Regarding C YCLIC, we show that it is not even uniform. In Subsection 5.2, we take a closer look at C YCLIC and show that almost all of its bits are pairwise independent. Finally, in Subsection 5.3, we present a simple recursive, but non uniform, hash technique called ID37, based on the classical Karp-Rabin hash family. In Section 6, we review the GT count-estimation algorithm. Since most of the computational cost of the GT algorithm is in the hashing, we argue that it might be inefficient in practice to repeatedly hash n-grams to apply a Chernoff bound. In Proposition 5, we offer a novel result, showing that better bounds on the GT count estimation algorithm are possible if we use k-wise independent hash values (k > 2) instead of merely pairwise ones. Finally, in Section 7, we present our experimental results of the GT count-estimation algorithm over the four hash families (n-wise independent, C YCLIC, G ENERAL, and ID37). Using both synthetic Zipfian data and text from the Gutenberg Project, we show that hashing each n-gram only once is sufficient to achieve good accuracy. The experimental results deliver some surprises. For example, C YCLIC and G ENERAL have a very similar performance despite the apparent theoretical edge of G ENERAL which is truly pairwise independent. We can partly explain this result since the GT count-estimation algorithm tends to mostly use the least significant bits of each hash value. A more surprising result is that C YCLIC and G ENERAL sometimes outperform the slower n-wise independent hashing method we used. 3.
RELATED WORK
Related work includes reservoir sampling, suffix arrays, and view-size estimation in OLAP. We can choose randomly, without replacement, k samples in a sequence of unknown length using a single pass through the data by reservoir sampling. Reservoir sampling [Vitter 1985; Kolonko and Wasch 2006; Li 1994] was first introduced by Knuth [Knuth 1969]. All reservoir-sampling algorithms begin by appending the first k samples to an array. In their linear-time (O(N)) form, reservoir-sampling algorithms sequentially visit every symbol, choosing it as a possible sample with probability k/t where t is the number of symbols read so far. The chosen sample is simply appended at the end of the array while an existing sample is flagged as having been removed. The array has an average size of k (1 + log N/k) samples at the end of the run. In their sublinear form (O(k(1 + log(N/k)) expected time), the algorithms skip a random number of data points each time. While these algorithms use a single pass, they assume that the number of required samples k is known a priori, but this is difficult without any knowledge of the data distribution. Using suffix arrays [Manber and Myers 1990; 1993] and the length of the maximal common prefix between successive prefixes, Nagao and Mori [Nagao and Mori 1994] proposed a fast algorithm to compute n-gram statistics exactly. However, it cannot be considered an
online algorithm even if we compute the suffix array in one pass: after constructing the suffix array, one must go through all suffixes at least once more. Their implementation was later improved by Kit and Wilks [Kit and Wilks 1998]. Compared to uncompressed suffix trees [Giegerich et al. 1999], uncompressed suffix arrays use several times less storage and their performance does not depend on the size of the alphabet. Suffix arrays can be constructed in O(N) time using O(N) working space [Hon et al. 2003]. Querying a suffix array for a given n-gram takes O(log N) time. By definition, each n-gram is a tuple of length n and can be viewed as a relation to be aggregated. OLAP (On-Line Analytical Processing) [Codd 1993] is a database acceleration technique used for deductive analysis, typically involving aggregation. To achieve acceleration, one frequently builds data cubes [Gray et al. 1996] where multidimensional relations are preaggregated in multidimensional arrays. OLAP is commonly used for business purposes with dimensions such as time, location, sales, expenses, and so on. Concerning text, most work has focused on informetrics/bibliomining, document management and information retrieval [McCabe et al. 2000; Mothe et al. 2003; Niemi et al. 2003; Bernard 1995; Sullivan 2001]. The idea of using OLAP for exploring the text content itself (including phrases and n-grams) was proposed for the first time by Keith, Kaser and Lemire [Keith et al. 2005b; 2005a; Kaser et al. 2006]. The estimation of n-gram counts can be viewed as an OLAP view-size estimation problem which itself “remains an important area of open research” [Dehne et al. 2006]. A data-agnostic approach to view-size estimation [Shukla et al. 1996], which is likely to be used by database vendors, can be computed almost instantly. For n-gram estimation, the number of attributes is the size of the alphabet |Σ| and η is the number of n-grams with possible repetitions (η = N − n + 1). Given η cells picked uniformly at random, with replacement, in a V = K1 × K2 × · · · Kn space, the probability that any given cell (think “ngram”) is omitted is (1 − V1 )η . For n-grams, V = |Σ|n . Therefore, the expected number of unoccupied cells is (1 − V1 )η × η. Similarly, assuming the number of n-grams is known to be m, the same model permits us to estimate the number of n−1-grams by m×(1−( Vm )|Σ| ). In practice, this approach systematically overestimates because relations are not uniformly distributed. A more sophisticated view-size estimation algorithm used in the context of data warehousing and OLAP [Shukla et al. 1996; Kotidis 2002] is logarithmic probabilistic counting [Flajolet and Martin 1985]. This approach requires a single pass and almost no memory, but it assumes independent hashing for which no algorithm using limited storage is known [Bar-Yossef et al. 2002]. Alon et al. [Alon et al. 1996, Proposition 2.3] showed probabilistic counting could work with only pairwise independence, but the error bounds are large (at least 100%): for any c > 2, the relative error is bounded by c − 1 with reliability 1 − 2/c. Practical results are sometimes disappointing [Dehne et al. 2006]. Other variants of this approach include linear probabilistic counting [Whang et al. 1990; Shah et al. 2004] and Loglog counting [Durand and Flajolet 2003], but their theoretical analysis also assumes independent hashing. View-size estimation through sampling has been made adaptive by Haas et al. [Haas et al. 1995]: their strategy is to first attempt to determine whether the distribution is skewed and then use an appropriate statistical estimator. We can also count the marginal frequencies of each attribute value (or symbol in an n-gram setting) and use them to give estimates as well as (exact) lower and upper bound on the view size [Yu et al. 2005]. Other re-
searchers make particular assumptions on the distribution of relations [Nadeau and Teorey 2003; Ciaccia et al. 2001; 2003; Faloutsos et al. 1996]. 4.
MULTIDIMENSIONAL RANDOM HASHING
Hashing encodes an object as a fixed-length bit string for comparison. Multidimensional hashing is a particular form of hashing where the objects can be represented as tuples. Multidimensional hashing is of general interest since several commonly occurring objects can be thought of as tuples: 32-bit values can be seen as 8-tuples containing 4-bit values. For convenience, we consider hash functions mapping keys to [0, 2L ), where the set U of possible keys is much larger than 2L . A difficulty with hashing is that any particular hash function h has some “bad inputs” S ⊂ U over which some hash value (such as 0) is either too frequent or not frequent enough (card(h−1 (0)) 6≈ card(S)/2L ) making count estimates from hash values difficult. Rather than make assumptions about the probabilities of bad inputs for a particular fixed hash function, an alternative approach [Carter and Wegman 1979] selects the hash function randomly from some family H of functions, all of which map U to [0, 2L ). Clearly, some families H have desirable properties that other families do not have. For instance, consider a family whose members always map to even numbers — then picking h randomly in H , we have P(h(x) = i) = 0 for any odd i and any x ∈ U. This would be an undesirable property for many applications. We now mention some desirable properties of families. H is uniform if, considering h selected uniformly at random from H and for all x and y, we have P(h(x) = y) = 1/2L . This condition is weak; the family of constant functions (h(x) = c) is uniform but would be disastrous when used with the GT algorithm. We need stronger conditions implying that any particular member h of the family must hash objects evenly over [0, 2L ). H is pairwise independent or universal [Carter and Wegman 1979] if we have that P(h(x1 ) = y ∧ h(x2 ) = z) = P(h(x1 ) = y)P(h(x2 ) = z) = 1/4L for all x1 , x2 , y, z with x1 6= x2 . We will refer to such an h ∈ H as a “pairwise independent hash function” when the family in question can be inferred from the context or is not important. Pairwise independence implies uniformity. As previously stated, we are only interested in hash values in [0, 2L ), but let us consider a more general case as an example. A well-known example of a pairwise-independent hash function for keys in the range [0, Br+1 ), where B is prime, is computed as follows. Express key x as xr xr−1 . . . x0 in base B. Randomly choose a number a ∈ [0, Br+1 ) and express it as ar ar−1 . . . a0 in base B. Then, set h(x) = ∑ri=0 ai xi mod B. The proof that it is pairwise independent follows from the fact that integers modulo a prime number form a field (GF(B)). Gibbons and Tirthapura showed that pairwise independence was sufficient to approximate count statistics accurately [Gibbons and Tirthapura 2001] essentially because the variance of the sum of pairwise independent variables is just the sum the variances (Var(X1 + . . . + X j ) = Var(X1 ) + . . . + Var(X j )). Moreover, the idea of pairwise independence can be generalized: a family of hash functions H is k-wise independent if given distinct x1 , . . . , xk and given h selected uniformly at random from H , then P(h(x1 ) = y1 ∧ · · · ∧ h(xk ) = yk ) = 1/2kL . Note that k-wise independence implies k − 1-wise independence and uniformity. (Fully) independent hash functions are k-wise independent for arbitrarily large k. This paper contributes better bounds for approximate count statistics, providing that more fully independent hash functions are used
(4-wise instead of pairwise, for instance). In the context of n-gram hashing, we seek recursive families of hash functions so that we can compute new hash values quickly by reusing previous hash values. A hash function h is recursive if there exists a (fixed) function F over triples such that h(x2 , x3 , . . . , xn+1 ) = F(h(x1 , x2 , . . . , xn ), x1 , xn+1 )
∀x1 , x2 , . . . , xn+1 .
The function F must be independent of h. (F is common to all members of the family). For example, if the values xi are integers, h(x1 , x2 , . . . , xn ) = ∑ni=1 xi mod 2L is a recursive hash function since we can choose F(a, b, c) = a − b + c mod 2L . As an example of a common recursive hash function, given tuples (x1 , x2 , . . . , xn ) whose components are integers taken from [0, B), we can hash by the Karp-Rabin formula ∑i xi Bi−1 mod R, where R is some prime defining the range of the hash function [Karp and Rabin 1987; Gonnet and Baeza-Yates 1990]. It is a poor hash function for us, since n-grams with common suffixes all get very similar hash values. For probabilisticcounting approaches based on the number of trailing zeros of the hashed value, if H = h(x1 , x2 , . . . , xn−1 , xn ) has many trailing zeros, then we know that h(x10 , x2 , . . . , xn−1 , xn ) will hash to a value close to H, which therefore cannot also have many trailing zeros (assuming x1 6= x10 ). In fact, no recursive hash function can be pairwise independent, even if we focus only on trailing zeros. Let zeros(x) return the number of trailing zeros (0,1,. . . ,L) of x, where zeros(0) = L. We say h is trailing-zero uniform if P(zeros(h(x)) ≥ j) = 2− j , for j = 0, 1, . . . , L and all n-grams x. Clearly, uniformity implies trailing-zero uniformity. Similarly, we say h is k-wise trailing-zero independent if P(zeros(h(x1 )) ≥ j1 ∧ zeros(h(x2 )) ≥ j2 ∧ . . . ∧ zeros(h(xk )) ≥ jk ) = 2− j1 − j2 −...− jk , for ji = 0, 1, . . . , L. Again, if h is k-wise independent, it is k-wise trailing-zero independent. The converse is not true. If h is a k-wise independent function, consider g ◦ h where g makes zero all bits before the rightmost 1 (e.g., g(0101100) = 0000100). Hash g ◦ h is k-wise trailing-zero independent but not even uniform (consider that P(g = 0001) = 8P(g = 1000)). The GT algorithm will behave identically whether using g ◦ h or h, showing that uniformity is not a necessary condition for good performance. P ROPOSITION 1. Recursive hash functions are not pairwise trailing-zero independent. P ROOF. Suppose h is a recursive pairwise independent hash function. Let hi, j = h(xi , . . . , x j ). Then F(h1,n , x1 , xn+1 ) = h2,n+1 where the function F must be independent of h. Fix the values x1 , . . . , xn+1 , then P(zeros(h1,n ) ≥ L ∧ zeros(h2,n+1 ) ≥ L) = P(h1,n = 0 ∧ F(0, x1 , xn+1 ) = 0) P(h1,n = 0) if F(0, x1 , xn+1 ) = 0 , = 0 otherwise a contradiction. Immediately, we have that recursive hash function are not pairwise independent. Hence, we are motivated to seek a weaker form of recursivity. By extension1 , a hash function h 1 This
extended sense resembles Cohen’s use of “recursive” [Cohen 1997].
is recursive over hash values τ(xi ), where τ is a randomized hash function for symbols, if there is a function F, independent of τ and h, such that h(x2 , . . . , xn+1 ) = F(h(x1 , . . . , xn ), τ(x1 ), τ(xn+1 )). A trivial example of a hash function recursive over hash values is h(x1 , x2 , . . . , xn ) = ∑ni=1 τ(xi ) mod 2L . (The multidimensional hashing function h depends implicitly on the hash function τ.) As the next proposition shows, being recursive over hashed values, while a weaker requirement, does not allow more than pairwise independence. P ROPOSITION 2. There is no 3-wise trailing-zero independent hashing function that is recursive over hashed values. P ROOF. Consider the string of symbols an bb. Suppose h is 3-wise trailing-zero independent, then P(zeros(h(a, . . . , a)) ≥ L ∧ zeros(h(a, . . . , a, b)) ≥ L ∧ zeros(h(a, . . . , a, b, b)) ≥ L) = P(h(a, . . . , a) = 0 ∧ F(0, τ(a), τ(b)) = 0 ∧ F(0, τ(a), τ(b)) = 0) = P(h(a, . . . , a) = 0 ∧ F(0, τ(a), τ(b)) = 0) = P(zeros(h(a, . . . , a)) ≥ L ∧ zeros(h(a, . . . , a, b)) ≥ L) = 2−2L by trailing-zero pairwise independence 6= 2−3L as required by trailing-zero 3-wise independence. Hence, we have a contradiction and no such h can exist. This also implies that no 3-wise independent hash can be recursive over hashed values. 5.
HASHING n-GRAMS
In this section, we begin by showing how we can construct a moderately efficient (nonrecursive) n-wise independent hash function. We then define three recursive hashing families C YCLIC, G ENERAL, and ID37, and study their properties, being especially attentive to C YCLIC. All four hashing families will later be benchmarked in the context of count estimation using the GT algorithm. A trivial way to generate an independent hash is to assign a random integer in [0, 2L ) to each new value x. Unfortunately, this requires as much processing and storage as a complete indexing of all values. However, in a multidimensional setting this approach can be put to good use. Suppose that we have tuples in K1 × K2 × · · · × Kn such that |Ki | is small for all i. We can construct independent hash functions hi : Ki → [0, 2L ) for all i and combine them. The hash function h(x1 , x2 , . . . , xn ) = h1 (x1 ) ⊕ h2 (x2 ) ⊕ · · · ⊕ hn (xn ) is then n-wise independent (⊕ is the exclusive or). As long as the sets Ki , are small, in time O(∑i |Ki |) we can construct the hash function by generating ∑i |Ki | random numbers and storing them in a look-up table (see Algorithm 1). With constant-time look-up, hashing an n-gram thus takes O(Ln) time, or O(n) if L is considered a constant. Unfortunately, this hash function is not recursive. In the n-gram context, we can choose h1 = h2 = . . . since Σ = K1 = K2 = . . .. While the resulting hash function is recursive over hashed values since h(x2 , . . . , xn+1 ) = h1 (x2 ) ⊕ · · · ⊕ h1 (xn+1 ) = h1 (x1 ) ⊕ h1 (xn+1 ) ⊕ h(x1 , . . . , xn ),
Algorithm 1 The (non-recursive) n-wise independent family. Require: n L-bit hash functions h1 , h1 , . . . , hn over Σ from an independent hash family 1: s ← empty FIFO structure 2: for each character c do 3: append c to s 4: if length(s)= n then 5: yield h1 (s1 ) ⊕ h2 (s2 ) ⊕ . . . ⊕ hn (sn ) {The yield statement returns the value, without terminating the algorithm.} 6: remove oldest character from s 7: end if 8: end for it is no longer even pairwise independent, since P(h(a, b) = w, h(b, a) = v) = 0 if w 6= v. 5.1
Recursive Hashing by Polynomials
We seek families of hash functions that behave, for practical purposes, as if n-wise independent while being recursive over hash values. A particularly interesting form of hashing using the binary Galois field GF(2) is called “Recursive Hashing by Polynomials” and has been attributed to Kubina by Cohen [Cohen 1997]. GF(2) contains only two values (1 and 0) with the addition (and hence subtraction) defined by “exclusive or”, a + b = a ⊕ b and the multiplication by “and”, a × b = a ∧ b. GF(2)[x] is the vector space of all polynomials with coefficients from GF(2). Any integer in binary form (e.g., c = 1101) can thus be interpreted as an element of GF(2)[x] (e.g., c = x3 + x2 + 1). If p(x) ∈ GF(2)[x], then GF(2)[x]/p(x) can be thought of as GF(2)[x] modulo p(x). As an example, if p(x) = x2 , then GF(2)[x]/p(x) is the set of all linear polynomials. For instance, x3 + x2 + x + 1 = x + 1 mod x2 since, in GF(2)[x], (x + 1) + x2 (x + 1) = x3 + x2 + x + 1. Interpreting h1 hash values as polynomials in GF(2)[x]/p(x), and with the condition that degree(p(x)) ≥ n, we define h(a1 , a2 , . . . , an ) = h1 (a1 )xn−1 + h1 (a2 )xn−2 + . . . + h1 (an ). It is recursive over the sequence h1 (ai ). The combined hash can be computed in constant time with respect to n by reusing previous hash values: h(a2 , a3 , . . . , an+1 ) = xh(a1 , a2 , . . . , an ) − h1 (a1 )xn + h1 (an+1 ). i Choosing p(x) = xL + 1 for L ≥ n, for any polynomial q(x) = ∑L−1 i=0 qi x , we have
xq(x) = x(qL−1 xL−1 + . . . + q1 x + q0 ) = qL−2 xL−1 + . . . + q0 x + qL−1 . Thus, we have that multiplication by x is a bitwise rotation, a cyclic left shift. The resulting hash (see Algorithm 2) is called R ECURSIVE H ASHING BY C YCLIC P OLYNOMIALS [Cohen 1997], or (for short) C YCLIC. It was shown empirically to be uniform [Cohen 1997], but it is not formally so: L EMMA 1. C YCLIC is not uniform for n even and never pairwise independent. P ROOF. If n is even, use the fact that xn−1 + . . . + x + 1 is divisible by x + 1 to write + . . . + x + 1 = (x + 1)r(x) for some polynomial r(x). Clearly, r(x)(x + 1)(xL−1 + L−2 x +. . .+x +1) = 0 mod xL + 1 for any r(x) and so P(h(a1 , a1 , . . . , a1 ) = 0) = P((xn−1 + . . . + x + 1)h1 (a1 ) = 0) = P((x + 1)r(x)h1 (a1 ) = 0) ≥ P(h1 (a1 ) = 0 ∨ h1 (a1 ) = xL−1 + xL−2 + . . . + x + 1) = 1/2L−1 . Therefore, C YCLIC is not uniform for n even. xn−1
Algorithm 2 The recursive C YCLIC family. Require: an L-bit hash function h1 over Σ from an independent hash family 1: c ← empty FIFO structure 2: x ← 0 (L-bit integer) 3: z ← 0 (L-bit integer) 4: for each character c do 5: append c to s 6: rotate x left by 1 bit 7: rotate z left by n bits 8: x ← x ⊕ z ⊕ h1 (c) 9: if length(s)= n then 10: yield x 11: remove oldest character y from s 12: z ← h1 (y) 13: end if 14: end for To show C YCLIC is never pairwise independent, consider n = 3 (for simplicity), then P(h(a1 , a1 , a2 ) = h(a1 , a2 , a1 )) = P((x + 1)(h1 (a1 ) + h1 (a2 )) = 0) ≥ P(h1 (a1 ) + h1 (a2 ) = 0 ∨ h1 (a1 ) + h1 (a2 ) = xL−1 + xL−2 + . . . + x + 1) = 1/2L−1 , but pairwise independent hash values are equal with probability 1/2L . The result is shown. Similarly, to generate hash functions over [0, 2L ), we can choose p(x) to be an irreducible polynomial of degree L in GF(2)[x]: an irreducible polynomial cannot be factored into nontrivial polynomials. For L = 19, an example is p(x) = x19 + x18 + x17 + x16 + x12 + x7 + x6 + x5 + x3 + x2 + 1 [Ruskey 2006]. (With this particular irreducible polynomial, L = 19 and so we require n ≤ 19. Irreducible polynomials of larger degree can be found [Ruskey 2006] if desired.) Computing (a18 x18 + . . . + a0 )x mod p(x) as a polynomial of degree 18 or less, for representation in 19 bits, can be done efficiently. We have (a18 x18 + . . .+a0 )x = a18 (p(x) − x19 ) + a17 x18 . . . + a0 x mod p(x) and the polynomial on the right-hand-side is of degree at most 18. In practice, we do a left shift of the hash value and if the value of the 20th bit is 1, then we apply an exclusive or with the integer 1 + 22 + 23 + 25 + 26 + 27 + 212 + 216 + 217 + 218 + 219 . The resulting hash is called R ECURSIVE H ASHING BY G ENERAL P OLYNOMIALS [Cohen 1997], or (for short) G ENERAL (see Algorithm 3). To summarize, both C YCLIC and G ENERAL hash n-grams with the formula h(a1 , a2 , . . . , an ) = xn−1 h1 (a1 ) + . . . + xh1 (an−1 ) + h1 (an ) but C YCLIC operates in GF(2)[x]/(1 + xL ) whereas C YCLIC does the computation in GF(2)[x]/p(x) for some irreducible polynomial. The main benefit of setting p(x) to be an irreducible polynomial is that GF(2)[x]/p(x) is a field; in particular, it is no longer possible that p1 (x)p2 (x) = 0 mod p(x) unless either p1 (x) = 0 or p2 (x) = 0. The field property allows us to prove that the hash function is pairwise independent (see Lemma 2), but it is not 3-wise independent because of Proposition 2. In the sense of Proposition 2, it is an optimal recursive hashing function. L EMMA 2. G ENERAL is pairwise independent. P ROOF. If p(x) is irreducible, then any non-zero q(x) ∈ GF(2)[x]/p(x) has an inverse, noted q−1 (x) since GF(2)[x]/p(x) is a field. Interpret hash values as polynomials in GF(2)[x]/p(x).
Algorithm 3 The recursive G ENERAL family. Require: an L-bit hash function h1 over Σ from an independent hash family; an irreducible polynomial p of degree L in GF(2)[x] 1: c ← empty FIFO structure 2: x ← 0 (L-bit integer) 3: z ← 0 (L-bit integer) 4: for each character c do 5: append c to s 6: x ← shift(x) 7: z ← shiftn (z) 8: x ← x ⊕ z ⊕ h1 (c) 9: if length(s)= n then 10: yield x 11: remove oldest character y from s 12: z ← h1 (y) 13: end if 14: end for 1: function shift 2: input L-bit integer x 3: shift x left by 1 bit, storing result in an L + 1-bit integer x0 4: if leftmost bit of x0 is 1 then 5: x0 ← x0 ⊕ p 6: end if 7: {leftmost bit of x0 is thus always 0} 8: return rightmost L bits of x0 Firstly, we prove that G ENERAL is uniform. In fact, we show a stronger result: P(q1 (x)h1 (a1 )+q2 (x)h1 (a2 )+. . .+qn (x)h1 (an ) = y) = 1/2L for any polynomials qi where at least one is different from zero. The result is shown by induction on the number of non-zero polynomials: it is clearly true where there is a single non-zero polynomial −1 qi (x), since qi (x)h1 (ai ) = y ⇐⇒ q−1 i (x)qi (x)h1 (ai ) = qi (x)y. Suppose it is true up to k − 1 non-zero polynomials and consider a case where we have k non-zero polynomials. Assume without loss of generality that q1 (x) 6= 0, we have P(q1 (x)h1 (a1 ) + q2 (x)h1 (a2 ) + . . . + qn (x)h1 (an ) = y) = P(h1 (a1 ) = q−1 1 (x)(y − q2 (x)h1 (a2 ) − . . . − qn (x)h1 (an ))) = 0 ))P(q (x)h (a ) + . . . + q (x)h (a ) = y0 ) = (x)(y − y ∑y0 P(h1 (a1 ) = q−1 ∑y0 21L 21L = 21L 2 1 2 n 1 n 1 by the induction argument. Hence the uniformity result is shown. Consider two distinct sequences a1 , a2 , . . . , an and a01 , a02 , . . . , a0n . Write Ha = h(a1 , a2 , . . . , an ) and Ha0 = h(a01 , a02 , . . . , a0n ). We have that P(Ha = y ∧ Ha0 = y0 ) = P(Ha = y|Ha0 = y0 )P(Ha0 = y0 ). Hence, to prove pairwise independence, it suffices to show that P(Ha = y|Ha0 = y0 ) = 1/2L . Suppose that ai = a0j for some i, j; if not, the result follows since by the (full) independence of the hashing function h1 , the values Ha and Ha0 are independent. Write q(x) = −(∑k|ak =ai xn−k )(∑k|a0 =a0j xn−k )−1 , then Ha + q(x)Ha0 is independent from ai = a0j k (and h1 (ai ) = h1 (a0j )). In Ha + q(x)Ha0 , only hashed values h1 (ak ) for ak 6= ai and h1 (a0k ) for a0k 6= a0j remain: label them h1 (b1 ), . . . , h1 (bm ). The result of the substitution can be written Ha + q(x)Ha0 =
∑k qk (x)h1 (bk ) where qk (x) are polynomials in GF(2)[x]/p(x). All qk (x) are zero if and only if Ha + q(x)Ha0 = 0 for all values of h1 (a1 ), . . . , h1 (an ) and h1 (a01 ), . . . , h1 (a0n ) (but notice that the value h1 (ai ) = h1 (a0j ) is irrelevant); in particular, it must be true when h1 (ak ) = 1 and h1 (a0k ) = 1 for all k, hence (xn + . . . + x + 1) + q(x)(xn . . . + x + 1) = 0 ⇒ q(x) = −1. Thus, all qk (x) are zero if and only if Ha = Ha0 for all values of h1 (a1 ), . . . , h1 (an ) and h1 (a01 ), . . . , h1 (a0n ) which only happens if the sequences a and a0 are identical. Hence, not all qk (x) are zero. Write Hy0 ,a0 = (∑k|a0 =a0j xn−k )−1 (y0 − ∑k|a0 6=a0j xn−k h1 (a0k )). On the one hand, the condik k tion Ha0 = y0 can be rewritten as h1 (a0j ) = Hy0 ,a0 . On the other hand, Ha + q(x)Ha0 = y + q(x)y0 is independent from h1 (a0j ) = h1 (ai ). Because P(h1 (a0j ) = Hy0 ,a0 ) = 1/2L irrespective of y0 and h1 (a0k ) for k ∈ {k|a0k 6= a0j }, then P(h1 (a0j ) = Hy0 ,a0 |Ha + q(x)Ha0 = y + q(x)y0 ) = P(h1 (a0j ) = Hy0 ,a0 ) which implies that h1 (a0j ) = Hy0 ,a0 and Ha + q(x)Ha0 = y + q(x)y0 are independent. Hence, we have P(Ha = y|Ha0 = y0 ) = P(Ha + q(x)Ha0 = y + q(x)y0 |h1 (a0j ) = Hy0 ,a0 ) = P(Ha + q(x)Ha0 = y + q(x)y0 ) = P(∑ qk (x)h1 (bk ) = y + q(x)y0 ) k
and by the earlier uniformity result, this last probability is equal to 1/2L . This concludes the proof. Of the four recursive hashing functions investigated by Cohen [Cohen 1997], G ENERAL and C YCLIC were superior both in terms of speed and uniformity, though C YCLIC had a small edge over G ENERAL. For n large, the benefits of these recursive hash functions compared to the n-wise independent hash function presented earlier can be substantial: n table look-ups2 is much more expensive than a single look-up followed by binary shifts. 5.2
Is C YCLIC almost pairwise independent?
While C YCLIC is not uniform, it was shown empirically to have good uniformity [Cohen 1997]. Hence it is reasonable to expect C YCLIC to be almost uniform and maybe even almost pairwise independent. To illustrate this intuition, consider Table I which shows that while h(a, a) is not uniform (h(a, a) = 001 is impossible), h(a, a) minus any bit is indeed uniformly distributed. We will prove that this result holds in general. The next lemma and the next theorem show that C YCLIC is quasi-pairwise independent in the sense that L − n + 1 consecutive bits (e.g., the first or last L − n + 1 bits) are pairwise independent. In other words, C YCLIC is pairwise independent if we are willing to sacrifice n − 1 bits. (We say that n bits are “consecutive modulo L” if the bits are located at indexes i mod L for n consecutive values of i such as i = k, k + 1, . . . , k + n − 1.) L EMMA 3. If q(x) ∈ GF(2)[x]/(xL + 1) (with q(x) 6= 0) has degree n < L, then —the equation q(x)w = y mod xL + 1 modulo the first n bits3 has exactly 2n solutions for all y; 2 Recall we assume that Σ is not known in advance. Otherwise for many applications, each table lookup could be merely an array access. 3 By “equality modulo hsome specified set of bit positionsi”, we mean that the two quantities are bitwise identical,
Table I. C YCLIC hash for various values of h1 (a) (h(a, a) = xh1 (a) + h1 (a) mod 2L + 1) h1 (a) h(a, a) h(a, a) (first two bits) h(a, a) (last two bits) h(a, a) (first and last bit) 000 000 00 00 00 100 110 11 10 10 010 011 01 11 01 110 101 10 01 11 001 101 10 01 11 101 011 01 11 01 011 110 11 10 10 111 000 00 00 00
—more generally, the equation q(x)w = y mod xL + 1 modulo any consecutive n bits (modulo L) has exactly 2n solutions for all y. P ROOF. Let P be the set of polynomials of degree at most L − n − 1. Take any p(x) ∈ P, then q(x)p(x) has degree at most L − n − 1 + n = L − 1 and thus if q(x) 6= 0 and p(x) 6= 0, then q(x)p(x) 6= 0 mod xL + 1. Hence, for any distinct p1 , p2 ∈ P we have q(x)p1 6= q(x)p2 mod xL + 1. To prove the first item, we begin by showing that there is always exactly one solution in P. Consider that there are 2L−n polynomials p(x) in P, and that all values q(x)p(x) are distinct. Suppose there are p1 , p2 ∈ P such that q(x)p1 = q(x)p2 mod xL + 1 modulo the first n bits, then q(x)(p1 − p2 ) is a polynomial of degree at most n − 1 while p1 − p2 is a polynomial of degree at most L − n − 1 and q(x) is a polynomial of degree n, thus p1 − p2 = 0. (If p1 − p2 6= 0 then degree(q(x)(p1 − p2) mod xL + 1) ≥ degree(q(x)) = n, a contradiction.) Hence, all p(x) in P are mapped to distinct values modulo the first n bits, and since there are 2L−n such distinct values, the result is shown. Any polynomial of degree L −1 can be decomposed into the form p(x)+xL−n z(x) where z(x) is a polynomial of degree at most n − 1 and p(x) ∈ P. By the preceding result, for distinct p1 , p2 ∈ P, q(x)(xL−n z(x) + p1 ) and q(x)(xL−n z(x) + p2 ) must be distinct modulo the first n bits. In other words, the equation q(x)(xL−n z(x) + p) = y modulo the first n bits has exactly one solution p ∈ P for any z(x) and since there are 2n polynomials z(x) of degree at most n − 1, then q(x)w = y (modulo the first n bits) must have 2n solutions. To prove the second item, choose j and use the first item to find any w solving q(x)w = yx j mod xL + 1 modulo the first n bits. j. Then wxL− j is a solution to q(x)w = y mod xL + 1 modulo the bits in positions j, j + 1, . . . , j + n − 1 mod L. We have the following corollary to Lemma 3. C OROLLARY 1. If w is chosen uniformly at random in GF(2)[x]/(xL + 1), then P(q(x)w = y mod n − 1 bits) = 1/2L−n+1 where the n − 1 bits are consecutive (modulo L). T HEOREM 5.1. All consecutive L − n + 1 bits (modulo L) of the L-bit C YCLIC n-gram hash family are pairwise independent. P ROOF. We show P(q1 (x)h1 (a1 ) + q2 (x)h1 (a2 ) + . . . + qn (x)h1 (an ) = y mod n − 1 bits) = 1/2L−n+1 for any polynomials qi where at least one is different from zero. It is true when there is a single non-zero polynomial qi (x) by Corollary 1. with exceptions permitted only at the specified positions. For our polynomials, “equality modulo the first n bit positions” implies the difference of the two polynomials has degree at most n − 1.
Suppose it is true up to k − 1 non-zero polynomials and consider a case where we have k non-zero polynomials. Assume without loss of generality that q1 (x) 6= 0, we have P(q1 (x)h1 (a1 ) + q2 (x)h1 (a2 ) + . . . + qn (x)h1 (an ) = y mod n − 1 bits) = P(q1 (x)h1 (a1 ) = y − q2 (x)h1 (a2 ) − . . . − qn (x)h1 (an ) mod n − 1 bits) = ∑y0 P(q1 (x)h1 (a1 ) = y − y0 mod n − 1 bits)P(q2 (x)h1 (a2 ) + . . . + qn (x)h1 (an ) = y0 mod n − 1 bits) = 1 1 = 1/2L−n+1 by the induction argument, where the sum is over 2L−n+1 ∑y0 2L−n+1 2L−n+1 0 values of y . Hence the uniformity result is shown. Consider two distinct sequences a1 , a2 , . . . , an and a01 , a02 , . . . , a0n . Write Ha = h(a1 , a2 , . . . , an ) and Ha0 = h(a01 , a02 , . . . , a0n ). To prove pairwise independence, it suffices to show that P(Ha = y mod n − 1 bits|Ha0 = y0 mod n − 1 bits) = 1/2L−n+1 . Suppose that ai = a0j for some i, j; if not, the result follows by the (full) independence of the hashing function h1 . Using Lemma 3, find q(x) such that q(x) ∑k|a0 =a0j xn−k = k
− ∑k|ak =ai xn−k mod n − 1 bits, then Ha + q(x)Ha0 mod n − 1 bits is independent from ai = a0j (and h1 (ai ) = h1 (a0j )). The hashed values h1 (ak ) for ak 6= ai and h1 (a0k ) for a0k 6= a0j are now relabelled as h1 (b1 ), . . . , h1 (bm ). Write Ha + q(x)Ha0 = ∑k qk (x)h1 (bk ) mod n − 1 bits where qk (x) are polynomials in GF(2)[x]/(xL + 1) (not all qk (x) are zero). As in the proof of Lemma 2, we have that Ha0 = y0 mod n − 1 bits and Ha + q(x)Ha0 = y + q(x)y0 mod n − 1 bits are independent4 : P(Ha0 = y0 mod n − 1 bits|y0 , b1 , b2 , . . . , bm ) = 1/2L−n+1 by Corollary 1 since Ha0 = y can be written as r(x)h1 (a0j ) = y − ∑k rk (x)h1 (bk ) for some polynomials r(x), r1 (x), . . . , rm (x). Hence, we have P(Ha = y mod n − 1 bits|Ha0 = y0 mod n − 1 bits) = P(Ha + q(x)Ha0 = y + q(x)y0 mod n − 1 bits|Ha0 = y0 mod n − 1 bits) = P(Ha + q(x)Ha0 = y + q(x)y0 mod n − 1 bits) = P(∑ qk (x)h1 (bk ) = y + q(x)y0 mod n − 1 bits) k
and by the earlier uniformity result, this last probability is equal to 1/2L−n+1 . However, Theorem 5.1 may be pessimistic: the hash values of some n-grams may be more uniform and independent. The next lemma and proposition show that as long as L and n are coprime, it suffices to drop or ignore one bit of the C YCLIC hash so that at least one n-gram (an ) is hashed uniformly. Recall that L and n are coprime if gcd(L, n) = 1; i.e., their greatest common divisor is 1. L EMMA 4. If n and L are coprime, then (xn + 1)w = 0 has only the following solutions in GF(2)[x]/(xL + 1): w = 0 and w = xL−1 + . . . + x2 + x + 1. P ROOF. Clearly, w = 0 and w = xL−1 + . . . + x2 + x + 1 are always solutions. The proof that these are the only solutions proceeds in three stages: first, we show that if w solves (xn + 1)w = 0 mod xL + 1, then w also solves (xkn + 1)w = 0 mod xL + 1, for all k. Second, we show that this implies (xi + 1)w = 0 mod xL + 1, for all i. Finally, we show that this implies all coefficients in w are equal, a property of only w = 0 and w = xL−1 + . . . + x2 + x + 1. 4 We
use the shorthand notation P( f (x, y) = c|x, y) = b to mean P( f (x, y) = c|x = z1 , y = z2 ) = b for all values of z1 , z2 .
For the first stage, we want to prove that (xn + 1)w = 0 mod xL + 1 implies (xkn + 1)w = 0 mod xL + 1 for any k = 1, 2, . . . by induction. To begin, we have that (xn + 1)2 = 1 + x2n . Suppose we have (xkn + 1)w = 0 mod xL + 1 for k = 1, 2, . . . , K − 1, we need that (xKn + 1)w = 0 mod xL + 1. We have (x(K−1)n + 1)w = 0 mod xL + 1 ⇒ (x + 1)(x(K−1)n + 1)w = 0 mod xL + 1 ⇒ (xKn + 1)w + (x(K−1)n + x)w = 0 mod xL + 1 ⇒ (xKn + 1)w + x(x(K−2)n + 1)w = 0 mod xL + 1 ⇒ (xKn + 1)w = 0 mod xL + 1 by the induction argument, showing the result. For the next stage, note that (xkn + 1)w mod xL + 1 = (xkn mod L + 1)w mod xL + 1. Suppose k1 n = k2 n mod L. Then (k1 −k2 )n = 0 mod L and so L divides k1 −k2 because n and L are coprime. Hence, the sequence of L elements 0, n, 2n, 3n, . . . , (L − 1)n visits all values in {1, 2, 3, . . . , L−1}. Hence, (xi +1)w = 0 mod xL + 1 for all values of i ∈ {1, 2, 3, . . . , L−1}. Finally, inspecting (xi + 1)w = 0 mod xL + 1 reveals that any two bits of w (the coefficients of any two of the L terms) must sum to zero in GF(2). That is, they must be equal. This proves that 0 and xL−1 + . . . + x2 + x + 1 are the only possible solutions. P ROPOSITION 3. Suppose a randomly chosen member of L-bit C YCLIC hash family is applied to the n-gram an . Then —If L and n are coprime and n is odd, then all L bits are uniformly distributed. —If L and n are coprime, any L − 1 bits of the L-bit hash are uniformly distributed. —If gcd(L, n) > 2, we do not have a uniform distribution of any L − 1 bits. —When gcd(L, n) = 2, then we have a uniform distribution of any L − 1 bits only when n/2 is odd. P ROOF. We have that h(a1 a2 . . . an ) = xn−1 h1 (a1 ) + . . . + xh1 (an−1 ) + h(an ) and so h(a, a, . . . , a) = (xn−1 + . . . + x + 1)h1 (a). Let q(x) = xn−1 + . . . + x + 1. We begin by showing that uniformity would follow, if q(x)w = 0 mod xL + 1 had a unique solution. There are 2L possible values of h1 (a) and 2L possible values of h(a, a, . . . , a) = q(x)h1 (a). If q(x)w = z mod xL + 1 has no solution w for some z, then there must be y such that q(x)w = y mod xL + 1 has k > 1 solutions w1 , w2 , . . . , wk . Then q(x)w = 0 mod xL + 1 has at least k solutions: q(x)0 = 0, q(x)(w1 − w2 ) = 0 mod xL + 1, q(x)(w1 − w3 ) = 0 mod xL + 1, . . . , q(x)(w1 − wk ) = 0 mod xL + 1. Therefore, if q(x)w = 0 mod xL + 1 has a unique solution (w = 0) then q(x)w = y mod xL + 1 has a unique solution for any y and we have uniformity. Our initial cases have n and L being coprime. Since q(x) = xn−1 + . . . + x + 1, then (x + 1)q(x) = xn + 1 mod xL + 1 and so q(x)w = 0 mod xL + 1 ⇒ (x + 1)q(x)w = 0 mod xL + 1 ⇒ (xn + 1)w = 0 mod xL + 1. Therefore, by Lemma 4: —when n is odd, then w = 0 is the only solution to q(x)w = 0 mod xL + 1, and thus, we have uniformity of all L bits, including any L − 1 bits; —when n is even, then w = 0 and w = xL−1 + . . . + x + 1 are the only solutions to q(x)w = 0 mod xL + 1. We wish to show that the equation q(x)w = y mod xL + 1 modulo the rth bit has exactly two solutions in GF(2)[x]/(xL + 1) for any y. This would then imply the probability that any L − 1 bits of h(a, a, . . . , a) be some value is 2/2L = 1/2L−1 , thus proving uniformity of any L − 1 bits. Consider q(x)w = 0 mod xL + 1 modulo the rth bit: if q(x)w 6= 0 mod xL + 1, then q(x)w = xr mod xL + 1. In such a case, q(x) has an inverse (wx−r ), and q(x)w = 0 mod xL + 1 has only one solution, a contradiction. Therefore, q(x)w = 0 mod xL + 1
modulo the rth bit implies q(x)w = 0 mod xL + 1 which implies that w = 0 or w = xL−1 + . . . + x + 1. Therefore, if q(x)w1 = q(x)w2 mod xL + 1 modulo the rth bit, then w1 − w2 differ by either 0 or xL−1 + . . . + x + 1. Thus, for any y there are at most two solutions for q(x)w = y mod xL + 1 modulo the rth th bit. Further, if w is a solution to q(x)w = y mod xL + 1 modulo the rth bit, then so is w + (xL−1 + . . . + x + 1): solutions come in pairs. There are 2L−1 solutions pairs and at most one solution pair for each of the 2L−1 values y ∈ GF(2)[x]/(xL + 1) modulo the rth bit. A pigeonhole argument shows exactly two solutions per y. Next, our second cases have L and n not being coprime, and we let k = gcd(L, n). Since k divides n, we have xn−1 + . . . + x + 1 = (xk−1 + . . . + x + 1)(xn−k + . . . + x2k + xk + 1), which can be verified by inspection. Since k divides L, we similarly have xL−1 + . . . + x + 1 = (xk−1 + . . . + x + 1)(xL−k + . . . + x2k + xk + 1). Since (xk−1 + . . . + x + 1)(x + 1)(xL−k + . . . + x2k + xk + 1) = (x + 1)(xL−1 + . . . + x + 1) = 0 mod xL + 1, w = (x + 1)(xL−k + . . . + x2k + xk + 1) is a solution of q(x)w = 0 mod xL + 1. For k > 2, w = (x + 1)(xL−k + . . . + x2k + xk + 1) is distinct from 0 and xL−1 + . . . + x + 1. Thus, as long as k is larger than two, we have a solution w distinct from 0 and xL−1 + . . . + x + 1, preventing any L − 1 bits from being uniform. Indeed, the probability that any L − 1 bits of (h(a, a, . . . , a) will be zero is at least 3/2L > 2L−1 . For k = 2, let w = ∑i=0,...,L−1 bi xi be a solution of q(x)w = 0 mod xL + 1. We have xn−1 + . . . + x + 1 = (x + 1)(xn−2 + . . . + x4 + x2 + 1). By Lemma 4, we have (xn−1 + . . . + x + 1)w = 0 mod xL + 1 if and only if (xn−2 + . . . + x4 + x2 + 1)w = 0 or (xn−2 + . . . + x4 + x2 + 1)w = xL−1 + . . . + x2 + x + 1. Using these two solutions, we will show that there are exactly two solutions when n/2 is odd and more when n/2 is even. We are going to split the problem into odd and even terms. Define wo = (i−1)/2 and w = i/2 . Then (xn−2 + . . . + x4 + x2 + 1)w = b x b x ∑i=1,3,...,L−1 i ∑i=0,2,...,L−2 i e 0 mod xL + 1 implies (xn/2−1 + . . . + x2 + x + 1)wo = 0 mod xL/2 + 1 and (xn/2−1 + . . . + x2 + x + 1)we = 0 mod xL/2 + 1 whereas (xn−2 + . . . + x4 + x2 + 1)w = xL−1 + . . . + x + 1 mod xL + 1 implies (xn/2−1 + . . . + x2 + x + 1)wo = xL/2−1 + . . . + x + 1 mod xL/2 + 1 and (xn/2−1 + . . . + x2 + x + 1)we = xL/2−1 + . . . + x + 1 mod xL/2 + 1. Notice that n/2 and L/2 are necessarily coprime so we can use the results we have derived above. —If n/2 is odd, then (xn−2 + . . . + x4 + x2 + 1)w = 0 mod xL + 1 implies wo = 0 and we = 0, and hence w = 0. Using a symmetric argument, (xn/2−1 + . . . + x2 + x + 1)wo = xL/2−1 + . . . + x + 1 mod xL/2 + 1 and (xn/2−1 + . . . + x2 + x + 1)we = xL/2−1 + . . . + x + 1 mod xL/2 + 1, imply wo = we = xL/2−1 + . . . + x2 + x + 1. Therefore, when n/2 is odd, only w = 0 and w = xL−1 + . . . + x + 1 are possible solutions. —When n/2 is even, then (xn−2 + . . . + x4 + x2 + 1)w = 0 mod xL + 1 has two solutions (wo = 0 and wo = xL/2−1 + . . . + x2 + x + 1) and so does (xn/2−1 + . . . + x2 + x + 1)we = 0 mod xL/2 + 1 (we = 0 and we = xL/2−1 + . . . + x2 + x + 1). Hence, q(x)w = 0 mod xL + 1 has at least 4 solutions w = 0, w = xL−2 +. . .+x4 +x2 +1, w = xL−1 +. . .+x3 +x, w = xL−1 + . . . + x + 1. This concludes the proof. 5.3
Integer-Division hashing (ID37) is not uniform
A variation of the Karp-Rabin hash method is “Hashing by Power-of-2 Integer Division” [Cohen 1997], where h(x1 , . . . , xn ) = ∑i xi Bi−1 mod 2L . Parameter B needs to be
chosen carefully, so that the sequence Bk mod 2L for k = 1, 2, . . . does not repeat quickly. In particular, the hashcode method of the Java String class uses this approach, with L = 32 and B = 31 [Sun Microsystems 2004]. Note that B is much smaller than the number of Unicode characters (about 99,000> 216 ) [The Unicode Consortium 2006]. A widely used textbook [Weiss 1999, p. 157] recommends a similar Integer-Division hash function for strings with B = 37. Since such Integer-Division hash functions are recursive, quickly computed, and widely used, it is interesting to seek a randomized version of them. Assume that h1 is random hash function over symbols uniform in [0, 2L ), then define h(x1 , . . . , xn ) = h1 (x1 ) + Bh1 (x2 ) + B2 h1 (x3 ) + . . . + Bn−1 h1 (xn ) mod 2L for some fixed integer B. We choose B = 37 (calling the resulting randomized hash “ID37;” see Algorithm 4). Algorithm 4 The recursive ID37 family. Require: an L-bit hash function h1 over Σ from an independent hash family 1: B ← 37 2: c ← empty FIFO structure 3: x ← 0 (L-bit integer) 4: z ← 0 (L-bit integer) 5: for each character c do 6: append c to s 7: x ← x − z mod 2L 8: x ← x/B + Bn−1 h1 (c) mod 2L 9: if length(s)= n then 10: yield x 11: remove oldest character y from s 12: z ← h1 (y) 13: end if 14: end for
Observe that ID37 is recursive over h1 . Moreover, by letting h1 map symbols over a wide range, we intuitively can reduce the undesirable dependence between n-grams sharing a common suffix. However, we fail to achieve uniformity. The independence problem with ID37 is shared by all such randomized Integer-Division hash functions that map n-grams to [0, 2L ). However, they are more severe for certain combinations of B and n. P ROPOSITION 4. Randomized Integer-Division (2L ) hashing with B odd is not uniform for n-grams, if n is even. Otherwise, it is uniform, but not pairwise independent. P ROOF. We see that P(h(a2k ) = 0) > 2−L since h(a2k ) = h1 (a)(B0 (1+B)+B2 (1+B)+ . . . + B2k−2 (1 + B)) mod 2L and since (1 + B) is even, we have P(h(a2k ) = 0) ≥ P(h1 (x1 ) = 2L−1 ∨ h1 (x1 ) = 0) = 1/2L−1 . For the rest of the result, we begin with n = 2 and B even. If x1 6= x2 , then P(h(x1 , x2 ) = y) = P(Bh1 (x1 ) + h1 (x2 ) = y mod 2L ) = ∑z P(h1 (x2 ) = y − Bz mod 2L )P(h1 (x1 ) = z) = ∑z P(h1 (x2 ) = y − Bz mod 2L )/2L = 1/2L , whereas P(h(x1 , x1 ) = y) = P((B + 1)h1 (x1 ) = y mod 2L ) = 1/2L since (B + 1)x = y mod 2L has a unique solution x when B is even.
Therefore h is uniform. This argument can be extended for any value of n and for n odd, B even. To show it is not pairwise independent, first suppose that B is odd. For any string β of length n − 2, consider n-grams w1 = βaa and w2 = βbb for distinct a, b ∈ Σ. Then P(h(w1 ) = h(w2 )) = P(B2 h(β) + Bh1 (a) + h1 (a) = B2 h(β) + Bh1 (a) + h1 (a) mod 2L ) = P((1 + B)(h1 (a) − h1 (b)) mod 2L = 0) ≥ P(h1 (a) − h1 (b) = 0) + P(h1 (a) − h1 (b) = 2L−1 ) = 2/4L . Second, if B is even, a similar argument shows P(h(w3 ) = h(w4 )) ≥ 2/4L , where w3 = βaa and w4 = βba. P(h(a, a) = h(b, a)) = P(Bh1 (a) + h1 (a) = Bh1 (b) + h1 (a) mod 2L ) = P(B(h1 (a) − h1 (b)) mod 2L = 0) ≥ P(h1 (a) − h1 (b) = 0) + P(h1 (a) − h1 (b) = 2L−1 ) = 2/4L . This argument can be extended for any value of B and n. These results also hold for any Integer-Division hash where the modulo is by an even number, not necessarily a power of 2. Frequently, such hashes compute their result modulo a prime. However, even if this gave uniformity, the GT algorithm implicitly applies a modulo 2L because it ignores higher-order bits. It is easy to observe that if h(x) is uniform over [0, p), with p prime, then h0 (x) = h(x) mod 2L cannot be uniform. Whether the lack of uniformity and pairwise independence is just a theoretical defect can be addressed experimentally. 6.
COUNT ESTIMATION BY PROBING
Count estimates accuracy, using the algorithms of Gibbons and Tirthapura or of Bar-Yossef et al., depend heavily on the hash function used and the buffer memory allocated to the algorithms [Gibbons and Tirthapura 2001; Bar-Yossef et al. 2002]. This section shows that better accuracy bounds from a single run of the algorithm follow if the hash function is drawn from a family of p-wise independent hash functions (p > 2), than if it is drawn from a family of merely pairwise independent functions. Moreover, even for pairwise independent hashing, we improve significantly the previously reported theoretical bounds. For instance, in Corollary 2 the bound on the reliability is improved from 83% to more than 99%. In turn, this implies that less buffer space can achieve a desired quality of estimates. Another method for improving the estimates of these techniques is to run them multiple times (with a different hash function chosen from its family each time). Then, take the median of the various estimates. For estimating some quantity µ within a relative error (“precision”) of ε, it is enough to have a sequence of random variables Xi for i = 1, . . . , q such that P(|Xi − µ| > εµ) < 1/3 where µ = X¯i . The median of all Xi will lie outside (µ − εµ, µ + εµ) only if more than half the Xi do. This, in turn, can be made very unlikely simply by considering many different random variables (q large). Let Yi be the random variable taking the value 1 when |Xi − µ| > εµ and zero otherwise, and furthermore let q Y = ∑i=1 Yi . We have that E(Y ) ≤ q/3 and so, 3E(Y )/2 ≤ q/2. Then a Chernoff bound says that [Canny 2002] P(Y > q/2) ≤ P(Y > Y¯ (1 + 1/2)) !Y¯ e1/2 ≤ (1 + 1/2)1+1/2 !q/3 e1/2 ≤ (1 + 1/2)1+1/2
3e+06
p=2 p=4
2.5e+06
M
2e+06
1.5e+06
1e+06
500000
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ε 19/20
0.1
0.11
Fig. 1. For reliability 1 − δ = 0.95, we plot the memory usage M versus the accuracy ε for pairwise (p = 2) and 4-wise (p = 4) independent hash functions as per the bound of Proposition 5. Added independence substantially improves memory requirements for a fixed estimation accuracy according to our theoretical bounds. q
≤ e− 10×3 . Choosing q = 30 ln 1/δ, we have P(Y > q/2) ≤ δ proving that we can make the median of the Xi ’s within εµ of µ, 1 − δ of the time for δ arbitrarily small. On this basis, Bar-Yossef et al. [Bar-Yossef et al. 2002] report that they can estimate a count with rel1 5 ˜ ative precision ε and reliability 1 − δ, using O( ε2 + log m log 1δ ) bits of memory and ˜ log m + 1 log 1 ) amortized time. Unfortunately, in practice, repeated probing imO( ε δ plies rehashing all n-grams 30 ln 1/δ times, a critical bottleneck. Moreover, in a streaming context, the various runs are made in parallel and therefore 30 ln 1/δ different buffers are needed. Whether this is problematic depends on the application and the size of each buffer. We are able to generalize previous results [Gibbons and Tirthapura 2001; Bar-Yossef et al. 2002] from pairwise independence to p-wise independence by using the following generalization of Chebyshev’s inequality due to Schmidt et al. [Schmidt et al. 1993, Theorem 2.4, Equation III]. T HEOREM 6.1. Let X1 , . . . , Xm be a sequence of p-wise independent random variables that satisfy |Xi − E(Xi )| ≤ 1. Let X = ∑i Xi , then for C = max(p, σ2 (X)), we have p/2 pC ¯ P(|X − X| ≥ T ) ≤ . e2/3 T 2 In particular, when p = 2, we have 2C . e2/3 T 2 The next proposition shows that in order to reduce the memory usage drastically while preserving the same theoretical bounds on the accuracy and reliability, we can increase the ¯ ≥ T) ≤ P(|X − X|
5 i.e.,
the computed count is within εµ of the true answer µ, 1 − δ of the time
independence of the hash functions. In particular, our theory says that we can estimate the count within 10%, 19 times out of 20 by storing respectively 10.5, 2.5, and 2 thousand n-grams6 depending on whether we have pairwise-independent, 4-wise independent or 8wise independent hash values. Hence, there is no need to hash the n-grams more than once if we can assume that hash values are ≈ 4-wise independent in practice (see Fig. 1). Naturally, recursive hashing cannot be more than pairwise independent, but we always have the option of using slower non-recursive hashing. The following proposition is stated for n-grams, but applies to arbitrary items. P ROPOSITION 5. Hashing each n-gram only once, we can estimate the number of distinct n-grams within relative precision ε, with a p-wise independent hash for p ≥ 2 by storing M distinct n-grams (M ≥ 8p) and with reliability 1 − δ where δ is given by ! 8 p/2 p p/2 p/2 . δ = p/3 p/2 2 + p p/2 e M ε (2 − 1) More generally, we have p p/2 δ ≤ p/3 p/2 e M
α p/2 4 p/2 + p/2 p p/2 p (1 − α) α ε (2 − 1)
! .
for 4p/M ≤ α < 1 and any p, M. P ROOF. We generalize a proof by Bar-Yossef et al. [Bar-Yossef et al. 2002] Let Xt be the number of distinct elements having only zeros in the first t bits of their hash value. Then X0 is the number of distinct elements (X0 = m). For j = 1, . . . , m, let Xt, j be the binary random variable with value 1 if the jth distinct element has only zeros in the first t bits of its hash value, and zero otherwise. We have that Xt = ∑mj=1 Xt, j and so E(Xt ) = ∑mj=1 E(Xt, j ). Since the hash function is uniform, then P(Xt, j = 1) = 1/2t and so E(Xt, j ) = 1/2t , and hence, E(Xt ) = m/2t ⇒ 2t E(Xt ) = m. Therefore Xt can be used to determine m. Using pairwise independence, we can can show that σ2 (Xt ) ≤ 2mt . We have that σ2 (Xt ) ≤ m 2t because E(|Xt −
m 2 m | ) = E((Xt − t )2 ) 2t 2 m
= E(( ∑ (Xt, j − j=1
m 1 2 2 )) ) = σ ( ∑ Xt, j ) 2t j=1
m
=
∑ σ2 (Xt, j ) (by 2-wise independence)
j=1 m
=
j=1 m
≤
1
∑ 2t
j=1
6 in
1
1
1
1
∑ 2t (1 − 2t )2 + (1 − 2t )(0 − 2t )2
some dictionary, for instance in a hash table
=
m . 2t
By Theorem 6.1, we have p p/2 2t p/2 , as long as σ2 (Xt ) ≥ p. m p/2 ε p e p/3 Let M be the available memory and suppose that the hash function has L bits such that we know, a priori, that P(XL > M) ≈ 0. This is necessary since we do not wish to increase L dynamically. It is reasonable since for L and m fixed, P(XL > M) goes down as 1/M p : P(|Xt − m/2t | ≥ εm/2t ) ≤
P(XL > M) = P(XL − m/2L > M − m/2L ) ≤ P(|XL − m/2L | > M − m/2L ) p/2 p max(m/2L , p) ≤ e2/3 (M − m/2L )2 for M − m/2L > 0 where we used Theorem 6.1. For example, if p = 2, M = 256, L = 19, P(XL > M) ≤ 4.4 × 10−8 for m = 100, 000, 000. 0 The algorithm returns the value 2t Xt 0 where t 0 is such that Xt 0 ≤ M and Xt 0 −1 > M. Note that t 0 is itself a random quantity that depends deterministically on the hash function and the input (the same factors that determine Xt .) We can bound the error of this estimate as follows: 0
P(|2t Xt 0 − m| ≥ εm) m m = ∑ P(|Xt − t | ≥ ε t )P(t 0 = t) 2 2 t=0,...,L m m = ∑ P(|Xt − t | ≥ ε t )P(Xt−1 > M, Xt ≤ M). 2 2 t=0,...,L Splitting the summation into two parts, we get 0
P(|2t Xt 0 − m| ≥ εm) t¯−1
≤
m
m
∑ P(|Xt − 2t | ≥ ε 2t ) + ∑
P(Xt−1 > M, Xt ≤ M)
t=t¯,...,L
t=0
t¯−1
= P(Xt¯−1 > M) + ∑ P(|Xt − t=0
m m |≥ε t) 2t 2 t¯−1
p p/2 2t p/2 p/2 ε p e p/3 t=0 m
≤ P(Xt¯−1 − m/2t¯−1 > M − m/2t¯−1 ) + ∑
t¯−1
p p/2 2t p/2 p/2 ε p e p/3 t=0 m
≤ P(|Xt¯−1 − m/2t¯−1 | > M − m/2t¯−1 ) + ∑ ≤
≤
pm 2t¯−1 e2/3 (M − m/2t¯−1 )2
p/2
p p/2 m p/2 2(t¯−1)p/2 e p/3 (M − m/2t¯−1 ) p
t¯−1
p p/2 2t p/2 p/2 ε p e p/3 t=0 m
+∑ +
p p/2 (2t¯p/2 − 1) m p/2 ε p e p/3 (2 p/2 − 1)
where we assumed that p ≤ m/2t¯−1 . Choose t¯ ∈ {0, . . . , L} such that αM/4 < p then
m 2t¯
≤ αM/2 for some α < 1 satisfying αM/4 ≥
p p/2 m p/2 2(t¯−1)p/2 e p/3 (M − m/2t¯−1 ) p
≤
p p/2 M p/2 α p/2 e p/3 (1 − α) p M p
whereas p p/2 (2t¯p/2 − 1) p p/2 4 p/2 ≤ . m p/2 ε p e p/3 (2 p/2 − 1) ε p α p/2 M p/2 e p/3 (2 p/2 − 1) Hence, we have p
p
p
α2 42 ). + p P(|2 Xt 0 − m| ≥ εm) ≤ p p ( p p p (1 − α) α 2 ε (2 2 − 1) e3 M2 p2
t0
Setting α = 1/2, we have 0
P(|2t Xt 0 − m| ≥ εm) ≤
p p/2 2 p/2 p p/2 8 p/2 + p p/2 p/3 p/2 . p/3 p/2 e M ε M e (2 − 1)
For different values of p and ε, other values of α can give tighter bounds: the best value of α can be estimated numerically. The proof is completed. Again the following corollary applies not only to n-grams but also to arbitrary items. It follows from setting M = 576/ε2 , p = 2, and α = 1 − ε. C OROLLARY 2. With M = 576/ε2 and p = 2, we can estimate the count of distinct n-grams with reliability (1 − δ) of 99% for any ε > 0. P ROOF. From Proposition 5, consider p
δ ≤
p
p2 p
p
e3 M2
p
42 α2 + p p p p (1 − α) α 2 ε (2 2 − 1)
!
for α = 1 − ε, M = 576/ε2 , and p = 2, then 2ε2 1−ε 4 δ ≤ 2 + 2 (1 − ε)ε2 e 3 576 ε 4 2 ≤ 2 1−ε+ . (1 − ε) e 3 576 Taking the limit as ε tends to 0, on the basis that the probability of a miss (P(|Xt − m/2t | ≥ 2 εm/2t )) can only grow as ε diminishes, we have that δ is bounded by 10/(e 3 576) ≈ 0.008. This significantly improves the value 5/6 for 1 − δ given in Bar-Yossef et al. [Bar-Yossef et al. 2002], for the same value of M: the bound on the rate of failure goes from 1/6 to less than 1%. Because p-wise independence implies p − 1-wise independence, memory usage, accuracy and reliability can only improve as p increases. For p large (p 4), the result of Proposition 5 is no longer useful because of the p p factor. Other inequalities on P(|X − µ|) for p-wise independent X’s should be used.
Table II. Summary of the 4 n-gram hash families implemented. Recall that we chose p(x) = x19 + x18 + x17 + x16 + x12 + x7 + x6 + x5 + x3 + x2 + 1 (G ENERAL). hash family definition (h(a1 , a2 , . . . , an )) recursive independence n-wise independent h1 (a1 ) ⊕ h1 (a2 ) ⊕ · · · ⊕ h1 (an ) no n-wise ind. C YCLIC xn−1 h1 (a1 ) + . . . + h1 (an ) in GF(2)[x]/(x19 + 1) yes quasi-pairwise ind. G ENERAL xn−1 h1 (a1 ) + . . . + +h1 (an ) in GF(2)[x]/p(x) yes pairwise ind. ID37 h1 (a1 ) + . . . + Bn−1 h1 (an ) mod 219 for B = 37 yes not even uniform
7.
EXPERIMENTAL RESULTS
Experiments are used to assess the accuracy of estimates obtained by several hash functions (n-wise independent, C YCLIC, G ENERAL, and ID37, see Table II) on some input streams using the GT count-estimation algorithm. All hash values are in [0, 219 ). Since we are mostly interested in comparing n-gram hash families, we chose not to compare variants of GT [Bar-Yossef et al. 2002] or other possible count-estimation algorithms such as probabilistic counting [Flajolet and Martin 1985]. For an experimental comparison of several such count-estimation algorithms in the context of multidimensional hashing, see [Aouiche and Lemire 2007]. The experiments demonstrate that the techniques can be efficiently implemented. Details are available in a technical report [Lemire and Kaser 2006] together with experimental results on entropy and frequent-item-count estimations. Our code is written in C++ and is available upon request. 7.1
Test Inputs
One complication is that the theoretical predictions are based on worst-case inputs, which are unknown to us. In fact, the worst-case input may not even correspond to a stream of n-grams (where n − 1 characters from one n-gram reappear in the following n-gram). As well, we want to know how these techniques perform on “typical” inputs. As a result, our experiments used the n-grams from a collection of 11 texts7 from Project Gutenberg. We also used synthetic data sets generated according to various generalized Zipfian distributions. Since we are analyzing the performance of several randomized algorithms, we ran each algorithm 100+ times on each text. As a result, we cannot run tests on inputs as large as would be appropriate for corpus linguistics studies: to complete the entire suite of experiments in reasonable time, we must limit ourselves to texts (for instance, Shakespeare’s First Folio) where one run takes at most a few minutes. Even so, the entire suite of experiments takes several days to run using current equipment. 7.2
Accuracy of Estimates
We have theoretical bounds relating to the error ε observed with a given reliability (typically 19/20), when the hash function is taken from a p-wise independent family. (See Table III.) But how close to this bound do we come when n-grams are drawn from a “typical” input for a computational-linguistics study? And do hash functions from highly 7 The
11 texts are eduha10 (The Education of Henry Adams), utrkj10 (Unbeaten Tracks in Japan), utopi10 (Utopia), remus10 (Uncle Remus His Songs and His Sayings), btwoe10 (Barchester Towers), 00ws110 (Shakespeare’s First Folio), hcath10 (History of the Catholic Church), rlchn10 (Religions of Ancient China), esymn10 (Essay on Man), hioaj10 (Impeachment of Andrew Johnson), and wflsh10 (The Way of All Flesh).
Table III. Maximum error rates ε 19 times out of 20 for various amounts of memory (M) and for p-wise independent hash values according to Proposition 5 . 256 1,024 2,048 65,536 262,144 1,048,576 p = 2 86.4% 36.8% 24.7% 3.8% 1.8% 0.9% p = 4 34.9% 16.1% 11.1% 1.8% 0.9% 0.5% p = 8 30.0% 14.1% 9.7% 1.6% 0.8% 0.4%
0.09
ID37 general polynomial cyclic polynomial n-wise independent
0.08 0.07 0.06 ε
0.05 0.04 0.03 0.02 0.01 0 0
10
20
30
40 50 60 trial rank
70
80
90 100
Fig. 2. Count estimate errors over Shakespeare’s First Folio (00ws110.txt), 100 runs estimating 10-grams with M = 2048.
independent families actually enable more accurate8 estimates? Figure 2 shows the relative error ε observed from four hash functions (100 estimations with each). Estimates have been ranked by decreasing ε, and we see ID37 had more of the poorer runs than the others. Figure 3 shows a test input (remus) that was the worst of the 11 for several hash functions, when M = 256. ID37 seems to be doing reasonably well, but we see 10-wise independent hashing lagging. To study the effect of varying M, we use the 5th -largest error of 100 runs. This 95th percentile error can be related to the theoretical bound for ε with 19/20 reliability. Figure 4 plots the largest 95th -percentile error observed over 11 test inputs. It is apparent that there is no significant accuracy difference between the hash functions. The n-wise independent hash alone has a guarantee to be beneath the theoretical bound for the 10-wise independent hash according to Proposition 5: however, over the eleven Gutenberg texts, the others are just as accurate, according to our experiments. 7.3
Using a Wider Range of Values for M
An important motivation for using p-wise independent hashing is to obtain a reliable estimate while only hashing once, using a small M. Nevertheless, we have thus far not 8 The
“data-agnostic” estimate from Sect. 3 is hopelessly inaccurate: it predicts 4.4 million 5-grams for Shakespeare’s First Folio, but the actual number is 13 times smaller.
0.3
ID37 general polynomial cyclic polynomial n-wise independent
0.25
ε
0.2 0.15 0.1 0.05 0 0
Fig. 3.
10
20
30
40 50 60 trial rank
70
80
90 100
remus errors, 100 runs estimating 10-grams with M = 256.
ID37 general polynomial cyclic polynomial n-wise independent pairwise bound 10-wise bound
ε 19/20
1
10
0.1
0.01 1000 M
Fig. 4.
For each M and hash function, worst-case 95th -percentile error observed on the 11 test inputs.
observed notable differences between the different hash functions. Therefore, although we expect typical values of M to be a few hundred to a few thousand, we can broaden the range of M examined. Although the theoretical guarantees for tiny M are poor, perhaps typical results will be usable. And even a single buffer with M = 220 is inconsequential when a desktop computer has several gibibytes of RAM, and the construction of a hash table or B-tree with such a value of M is still quite affordable. Moreover, with a wider range of M, we start to see differences between some hash functions. We choose M = 16, 162 , 163 and 164 and analyze the 5-grams in the text 00ws1 (Shakespeare’s First Folio). There are approximately 300,000 5-grams, and we selected a larger file because when M = 164 , it seems unhelpful to estimate the number of 5-grams unless
100
ID37 general polynomial cyclic polynomial n-wise independent pairwise bound 5-wise bound
ε 19/20
10 1 0.1 0.01 0.001 10
100
1000 M
10000
100000
Fig. 5. 95th -percentile error values (for 5-gram estimates) on 00ws1 for various hash families, over a wide range of M. Our analysis does not permit prediction of error bounds when M = 16.
the file contains substantially more 5-grams than M. Figure 5 shows the 95th -percentile errors for Shakespeare’s First Folio, when 5-grams are estimated. There are some smaller differences for M = 65536 (surprisingly, the 5-wise hash function, with a better theoretical guarantee, seems to be slightly worse than C YCLIC and G ENERAL). However, it is clear that the theoretical deficiencies in ID37 finally have an effect: it is small when M = 4096 but large at M = 65536. (We observed similar problems on Zipfian data.) To be fair, this non-uniform hash is still performing better than the pairwise bound, but the trend appears clear. Does it, however, continue for very large M? 7.4
Very Large M
Clearly, it is only sensible to measure performance when M m. Therefore, we estimate the number of 10-grams obtained when all plain-text files in the Gutenberg CD are concatenated. When M = 220 , 10-wise independent hashing had an observed 95th -percentile error of 0.182% and G ENERAL had 0.218%. The ID37 error was somewhat worse, at 0.286%. (The theoretical pairwise error bound is 0.908% and the 10-wise bound is 0.425%.) Considering the M = 65536 case from Fig. 5, we see no experimental reason to prefer n-wise hashing to G ENERAL, but ID37 looks less promising. However, n = 10, B = 37 is a nonuniform combination for Integer Division. 7.5
Caveats with Random-Number Generators
To observe the effect of fully independent hashing, we implemented the usual (slow and memory-intensive) scheme where a random value is assigned and stored whenever a key is first seen. Probabilistic estimation of the number of n-grams is likely to expose deficiencies in the random-number generator and therefore different techniques were tried. The pseudorandom-number generator in the GNU/Linux C library was tried, as were the Mersenne Twister (MT) [Matsumoto and Nishimura 1998] and also the Marsaglia-Zaman-
ε 19/20
10
ID37 general polynomial cyclic polynomial n-wise independent pairwise bound 5-wise bound
1
0.1
0.01 10
100
1000
10000
100000
M
Fig. 6.
95th -percentile error for 5-gram estimates on Zipfian data (s=1).
James (MZJ) generator [Marsaglia and Zaman 1987; James 1990; Bourke 1998]. We also tried using a collection of bytes generated from a random physical process (radio static) [Haahr 1998]. For M = 4096, the 95th -percentile error for text 00ws1 was 4.7% for Linux rand(), 4.3% for MT and 4.1% for MZJ. These three pseudorandom number generators were no match for truly random numbers, where the 95% percentile error was only 2.9%. Comparing this final number to Fig. 5, we see fully independent hashing is only a modest improvement on Cohen’s hash functions (which fare better than 5%) despite its stronger theoretical guarantee. The other hash functions also rely on random-number generation (for h1 in Cohen’s hashes and ID37; for h1 . . . hn in the n-wise independent hash). It would be problematic if their performance were heavily affected by the precise random-number generation process. However, when we examined the 95th -percentile errors we did not observe any appreciable differences from varying the the pseudorandom-number generation process or using truly random numbers. 7.6 95th -Percentile Errors using Zipfian Data
The various hash functions were also tested on synthetic Zipfian data, where the probability of the kth symbol is proportional to k−s . (We chose s ∈ [0.8, 2.0].) Each data set had N ≈ 105 , but for larger values of s there were significantly fewer n-grams. Therefore, measurements for M=65536 would not be meaningful and are omitted. Some results, for n = 5, are shown in Figs. 6–8. The ID37 method is noteworthy. Its performance for larger M is badly affected by increasing s. The other hash functions are nearly indistinguishable in almost all other cases. Results are similar for 5-grams and 10grams (which are not shown), except that s can grow slightly bigger for 10-grams before ID37 fails badly.
ε 19/20
10
ID37 general polynomial cyclic polynomial n-wise independent pairwise bound 5-wise bound
1
0.1
0.01 10
100
1000
10000
100000
M
Fig. 7.
95th -percentile error for 5-gram estimates on Zipfian data (s=1.6).
ε 19/20
10
ID37 general polynomial cyclic polynomial n-wise independent pairwise bound 5-wise bound
1
0.1
0.01 10
100
1000
10000
100000
M
Fig. 8.
7.7
95th -percentile error for 5-gram estimates on Zipfian data (s=2).
Choosing between Cohen’s Hash Functions
Recall that we proved that Cohen’s general polynomial hash function was pairwise uniform and no hash function that is recursive over hashed values can be 3-wise independent, or even 3-wise trailing-zero independent. In a sense, G ENERAL is optimal. Yet Cohen’s cyclic polynomial hash function can be slightly more efficiently implemented, and under some conditions (see Section 5.2), it can also be pairwise independent. The practical implication of these results is not clear; indeed the experiments reported so far do not reveal a significant distinction between C YCLIC and G ENERAL, at least at the 95th percentile error.
There are several possible explanations for these experimental results. First, we know that the GT algorithm relies on the trailing-zeros properties of the hashes. Perhaps C YCLIC is pairwise trailing-zeros independent in some cases when it is not pairwise independent. Alternatively, though we know that neither hash is 3-wise independent, possibly either of them might be “nearly 3-wise independent” in some sense (e.g., except when three ngrams had some unusual properties). Finally, our theory considers a worst-case sequence of n-grams whereas experiments are more easily able to assess average case inputs: the theoretical bad cases may be extremely rare. Recall that theory only guarantees pairwise independence for C YCLIC after we discard n − 1 bits out of the L bits. While we can simply generate more bits than we need, that may be slow. Perhaps, though, the theoretical penalties from using all bits are not observed in practice. Computationally, we can discover the worst-case input of length N and the exact distribution of errors. We enumerated all possible hash functions h1 and all possible inputs of length N. There are L|Σ| values for h1 and |Σ|N inputs, where |Σ| is the size of the alphabet. Clearly, these exhaustive tests are only feasible for small values of |Σ|, L and N. We tried to count trigrams over a binary input of length 10 using 7 bits for the hashed values (|Σ| = 2, L = 7, M = 2, N = 10, n = 3). Each of the 214 choices for h1 leads to an error, and as usual we are interested in the error at the 95th percentile. Note that there cannot be more than 8 distinct trigrams (m ≤ N − n + 1) and thus the ratio of the distinct count over the memory budget (m/M) is small enough that we should rarely use more than the 7 − 2 + 1 = 6 bits guaranteed pairwise independent by Theorem 5.1. Based on the computation, in this environment, on every input, C YCLIC and G ENERAL had exactly the same 95th -percentile error. This lead to the conjecture there was no practical difference between Cohen’s two approaches, even when we used more than L − n + 1 bits, and some initial experiments seemed to back this up. However, the conjecture ultimately proved false. Although experiments revealed many cases when C YCLIC behaved well even when more than L − n + 1 bits were used, we also found cases when violating this limit made C YCLIC behave poorly relative to G ENERAL9 . Figure 9 shows the errors when 13-grams were estimated using C YCLIC and either L = 7 or L = 19. The data sets were five binary sequences, each with an (independent) 70% chance of generating a 0 as the next symbol. These resemble binary Zipfian sequences, and each sequence’s length was chosen to obtain 2000 distinct 13-grams. The similar behaviors observed on all five sequences suggests we are probably observing an average-case behavior. On these sequences, we observed that GT never used more than 4 bits from the G EN ERAL hash. Yet when L = 7 we observed five instances when C YCLIC ran out of bits, indicating that its estimate was already more than 15 times too big. When L = 19, we have more than enough pairwise-independent bits to estimate the number of distinct 13grams. (Recall Theorem 5.1.) Because this bad result is observed for L = 7, it rules out the possibility that bad behavior can be prevented if L and n are coprime. The same process was also used to compare the performance of G ENERAL between 9 Note
that Table III tells us only that we can expect an error of at most 86%, 95% of the time. This is not a very strong guarantee! Here, we observe 95th -percentile errors of about 25% for C YCLIC. The point is not that we have observed worse-than-pairwise-independent behavior here (we have not); it is that when forced to use too many bits, C YCLIC can become clearly worse than G ENERAL.
0.3
cyclic, L=19 cyclic, L=7
0.25
error (ε )
0.2 0.15 0.1 0.05 0 0
10
20
30
40 50 60 error rank
70
80
90 100
Fig. 9. Count estimation errors using GT with C YCLIC hashing on 5 random binary sequences, each containing m = 2000 distinct 13-grams; buffer size (M) was 256. We generated 100 estimates for each binary sequence, and the relative errors of these estimates are displayed against their rank. Results are plotted for L = 7 and L = 19. Note that there are 12 points (all for L = 7) off the scale, one at ε = 0.32, six at ε = 1.0, and five at ε = 15.5.
Table IV. Count estimation error (%) using GT with polynomial hashes C YCLIC and G ENERAL, Zipfian data set. L=19, n=5. percentile Cyclic Polynomial General Polynomial M=64 M=1024 M=64 M=1024 25 3.60 0.931 3.60 0.931 50 7.06 1.98 8.49 2.01 75 14.0 3.41 13.7 3.60 95 30.9 6.11 31.2 6.22 mean 10.6 2.45 10.7 2.51
L = 7 and L = 19. They differed little, both being nearly the same as obtained with C YCLIC with L = 19. (Actually, C YCLIC with L = 19 was marginally better than G ENERAL!). Finally, we discuss results on realistic non-binary Zipfian data. For each run, we made a single random choice of h1 . Results, for more than 2000 runs, are shown in Table IV. The table shows ε values (percents), with boldfacing indicating a case when one technique had a lower error than the other. It shows a slightly better performance from C YCLIC. Means also slightly favour C YCLIC. This is consistent with the experimental results reported by Cohen [Cohen 1997]. We also ran more extensive tests using an English text (00ws1.txt), where there seems to be no notable distinction. 10,000 test runs were made, and results are shown in Table V. Overall, we concede there might be a small accuracy improvement from using C YCLIC, providing that L − n + 1 is somewhat larger than log2 (m/M) + 1, the number of bits that we anticipate GT using. However, this effect is never large — and the accuracy penalty from using too many bits can be large. Whether C YCLIC is viable, if used carefully, depends on whether its relative speed advantage is lost after n − 1 bits are discarded.
Table V. Count estimation error (%) using GT with polynomial hashes C YCLIC and G ENERAL, data set 00ws1, n=5 percentile C YCLIC G ENERAL M=64 M=1024 M=64 M=1024 25 5.79 1.30 5.79 1.30 50 10.7 2.58 10.7 2.69 75 18.2 4.55 19.0 4.55 95 30.6 7.69 30.6 7.69 mean 12.5 3.15 12.6 3.14 Table VI. Time (seconds) to process all Gutenberg CD files, 10grams. Hashing M = 210 M = 220 10-wise 794 938 ID37 268 407 Cyclic 345 486 General 352 489
7.8
Speed
Speeds were measured on a Dell Power Edge 6400 multiprocessor server (with four Pentium III Xeon 700 MHz processors having 2 MiB cache each, sharing 2 GiB of 133 MHz RAM). The OS kernel was Linux 2.4.20 and the GNU C++ compiler version 3.2.2 was used with relevant compiler flags -O2 -march=i686 -fexceptions. The STL map class was used to construct look-up tables. Only one processor was used, and the data set consisted of all the plain text files on the Project Gutenberg CD, concatenated into a single disk file containing over 400 MiB and approximately 116 million 10-grams. For comparison, this file was too large to process with the Sary suffix array [Takabayashi 2005] package (version 1.2.0), since the array would have exceeded 2 GiB. However, the first 200 MB was successfully processed by Sary, which took 1886 s to build the suffix10 array. The SUFARY [Yamashita 2005] (version 2.3.8) package is said to be faster than Sary [Takabayashi 2005]. It processed the 200 MB file in 2640 s and then required 95 s to (exactly) compute the number of 5-grams with more than 100,000 occurrences. Pipelined suffix-array implementations reportedly can process inputs as large as 4 GB in hours [Dementiev et al. 2005]. From Table VI we see that n-gram estimation can be efficiently implemented. First, comparing results for M = 220 to those for M = 210 , we see using a larger table costs roughly 140 s in every case. This increase is small when considering that M was multiplied by 210 and is consistent with the fact that the computational cost is dominated by the hashing. Comparing different hashes, using a 10-wise independent hash was about twice as slow as using a recursive hash. Hashing with ID37 was 15–25% faster than using Cohen’s approaches. Assuming that we are willing to allocate very large files to create suffix arrays and use much internal memory, an exact count is still at least 10 times more expensive than an approximation. Where the suffix-array approach would take about an hour to compute ngram counts over the entire Gutenberg CD, an estimate can be available in about 6 minutes 10 Various
command-line options were attempted and the reported time is the fastest achieved.
while using very little memory and no permanent storage. 8.
CONCLUSION
Considering speed, theoretical guarantees, and actual results, we recommend Cohen’s G ENERAL. It is fast, has a theoretical performance guarantee, and behaves at least as well as either ID37 or the n-wise independent approach. G ENERAL is pairwise independent so that there are minimal theoretical bounds to its performance. Unlike Cohen’s C YCLIC, one can safely use all of its bits and the minor speed advantage of C YCLIC does not seem worthwhile. The n-wise independent hashing comes with a stronger theoretical guarantee than either of Cohen’s hashes, and thus there can be no unpleasant surprises with its accuracy on any data set. Yet there is a significant speed penalty for its use in our implementation. The speed gain of ID37 is worthwhile only for very small values of M. Not only does it lack a theoretical accuracy guarantee, but for larger M it is observed to fall far behind the other hashing approaches in practice. Except where accuracy is far less important than speed, we cannot recommend ID37. There are various avenues for follow-up work that we are pursuing. First, further improvements to the theoretical bound seem possible, especially for more than 4-wise independent hashes. Generally, the current theory fails to completely explain our experimental results: why do C YCLIC and G ENERAL sometimes fare better than n-wise independent hashing at count estimation? Does increased independence improve the accuracy of the GT count-estimation algorithm? Second, each item being counted can have an occurrence count kept for it. This may enable entropy estimation as well as estimates of the number of distinct frequent n-grams [Lemire and Kaser 2006]. REFERENCES
A LON , N., M ATIAS , Y., AND S ZEGEDY, M. 1996. The space complexity of approximating the frequency moments. In STOC ’96. 20–29. AOUICHE , K. AND L EMIRE , D. 2007. Unassuming view-size estimation techniques in olap. In ICEIS’07. BACON , F. L. AND H OUDE , D. J. 1984. Data compression apparatus and method. US Patent 4612532. filed 1984; granted 1986. Assignee Telebyte (later Telcor Systems). BAR -YOSSEF, Z., JAYRAM , T. S., K UMAR , R., S IVAKUMAR , D., AND T REVISAN , L. 2002. Counting distinct elements in a data stream. In RANDOM’02. 1–10. BATU , T., DASGUPTA , S., K UMAR , R., AND RUBINFELD , R. 2002. The complexity of approximating entropy. In STOC’02. ACM Press, New York, NY, USA, 678–687. B ERNARD , M. 1995. A` juste titre: A lexicometric approach to the study of titles. Literary and Linguistic Computing 10, 2, 135–141. B OSE , R. P. J. C. AND S RINIVASAN , S. H. 2005. Data mining approaches to software fault diagnosis. In RIDE ’05. IEEE Computer Society, Washington, DC, USA, 45–52. B OURKE , P. 1998. Uniform random number generator. online: http://astronomy.swin.edu.au/∼pbourke/ other/random/index.html. checked 2007-05-30. C ANNY, J. 2002. CS174 lecture notes. http://www.cs.berkeley.edu/∼jfc/cs174/lecs/lec10/lec10. pdf. checked 2007-05-30. C AROPRESO , M. F., M ATWIN , S., AND S EBASTIANI , F. 2001. Text Databases & Document Management: Theory & Practice. Idea Group Publishing, Hershey, PA, USA, Chapter A Learner-Independent Evaluation of the Usefulness of Statistical Phrases for Automated Text Categorization, 78–102. C ARTER , L. AND W EGMAN , M. 1979. Universal classes of hash functions. Journal of Computer and System Sciences 18, 2, 143–154. C IACCIA , P., G OLFARELLI , M., AND R IZZI , S. 2001. On estimating the cardinality of aggregate views. In DMDW. 12.1–12.10.
C IACCIA , P., G OLFARELLI , M., AND R IZZI , S. 2003. Bounding the cardinality of aggregate views through domain-derived constraints. Data & Knowledge Engineering 45, 2, 131–153. C ODD , E. 1993. Providing OLAP (on-line analytical processing) to user-analysis: an IT mandate. Tech. rep., E.F. Codd and Associates. C OHEN , J. D. 1997. Recursive hashing functions for n-grams. ACM Trans. Inf. Syst. 15, 3, 291–320. C OHEN , P., H EERINGA , B., AND A DAMS , N. 2002. Unsupervised segmentation of categorical time series into episodes. In ICDM’02. 99–106. D EHNE , F., E AVIS , T., AND R AU -C HAPLIN , A. 2006. The cgmCUBE project: Optimizing parallel data cube generation for ROLAP. Distributed and Parallel Databases 19, 1, 29–62. D EMENTIEV, R., M EHNERT, J., K ARKKAINEN , J., AND S ANDERS , P. 2005. Better external memory suffix array construction. In ALENEX05: Workshop on Algorithm Engineering & Experiments. D ESHPANDE , M. AND K ARYPIS , G. 2004. Selective Markov models for predicting web page accesses. ACM Transactions on Internet Technology (TOIT) 4, 2, 163–184. ¨ D ORAISAMY, S. AND R UGER , S. 2003. Position indexing of adjacent and concurrent n-grams for polyphonic music retrieval. In ISMIR 2003. 227–228. D ROETTBOOM , M. 2003. Correcting broken characters in the recognition of historical printed documents. In Digital Libraries 2003. 364–366. D URAND , M. AND F LAJOLET, P. 2003. Loglog counting of large cardinalities. In ESA’03. Springer. FALOUTSOS , C., M ATIAS , Y., AND S ILBERSCHATZ , A. 1996. Modeling skewed distribution using multifractals and the 80-20 law. In VLDB’96. 307–317. F LAJOLET, P. AND M ARTIN , G. 1985. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences 31, 2, 182–209. G AO , J. AND Z HANG , M. 2001. Improving language model size reduction using better pruning criteria. In ACL’02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Morristown, NJ, USA, 176–182. G IBBONS , P. B. AND T IRTHAPURA , S. 2001. Estimating simple functions on the union of data streams. In SPAA’01. 281–291. G IEGERICH , R., K URTZ , S., AND S TOYE , J. 1999. Efficient implementation of lazy suffix trees. In WAE’99. 30–42. G ONNET, G. H. AND BAEZA -YATES , R. A. 1990. An analysis of the Karp-Rabin string matching algorithm. Information Processing Letters 34, 5, 271–274. G RAY, J., B OSWORTH , A., L AYMAN , A., AND P IRAHESH , H. 1996. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total. In ICDE ’96. 152–159. G UHA , S., M C G REGOR , A., AND V ENKATASUBRAMANIAN , S. 2006. Streaming and sublinear approximation of entropy and information distances. In SODA’06. ACM Press, New York, NY, USA, 733–742. H AAHR , M. 1998. random.org — true random number service. online, http://www.random.org. checked 2007-05-30. H AAS , P., NAUGHTON , J., S ESHADRI , S., AND S TOKES , L. 1995. Sampling-based estimation of the number of distinct values of an attribute. In VLDB’95. 311–322. H ON , W., S ADAKANE , K., AND S UNG , W. 2003. Breaking a time-and-space barrier in constructing full-text indices. In FOCS’03. 251–260. JAMES , F. 1990. A review of pseudorandom number generators. Computer Physics Communications 60, 329– 344. J ELINEK , F. 1998. Statistical methods for speech recognition. MIT Press Cambridge, MA, USA. J OULA , P., S OFKO , J., AND B RENNAN , P. 2006. A prototype for authorship attribution studies. Literary and Linguistic Computing 21, 2, 169–178. K ARP, R. AND R ABIN , M. 1987. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31, 2, 249–260. K ASER , O., K EITH , S., AND L EMIRE , D. 2006. The LitOLAP project: Data warehousing with literature. In CaSTA’06. K EARNS , M., M ANSOUR , Y., RON , D., RUBINFELD , R., S CHAPIRE , R. E., AND S ELLIE , L. 1994. On the learnability of discrete distributions. In STOC’94. 273–282.
K EITH , S., K ASER , O., AND L EMIRE , D. 2005a. Analyzing large collections of electronic text using OLAP. In APICS 2005. K EITH , S., K ASER , O., AND L EMIRE , D. 2005b. Analyzing large collections of electronic text using OLAP. Tech. Rep. TR-05-001, UNBSJ CSAS. June. K ESELJ , V. AND C ERCONE , N. 2004. CNG method with weighted voting. In ad-hoc Authorship Attribution Contest, P. Joula, Ed. AHC/ALLC. K IM , M.-S., W HANG , K.-Y., L EE , J.-G., AND L EE , M.-J. 2005. n-gram/2l: a space and time efficient two-level n-gram inverted index structure. In VLDB’05. VLDB Endowment, 325–336. K IT, C. AND W ILKS , Y. 1998. The Virtual Corpus approach to deriving n-gram statistics from large scale corpora. In Proceedings of 1998 International Conference on Chinese Information Processing. 223–229. K NUTH , D. E. 1969. Seminumerical Algorithms. The Art of Computer Programming, vol. 2. Addison-Wesley. KOLONKO , M. AND WASCH , D. 2006. Sequential reservoir sampling with a nonuniform distribution. ACM Trans. Math. Softw. 32, 2, 257–273. KOTIDIS , Y. 2002. Handbook of Massive Data Sets. Kluwer Academic Publishers, Norwell, MA, USA, Chapter Aggregate View Management in Data Warehouses, 711–741. L EMIRE , D. AND K ASER , O. 2006. One-pass, one-hash n-gram count estimation. Tech. Rep. TR-06-001, Dept. of CSAS, UNBSJ. available from http://arxiv.org/abs/cs.DB/0610010. L I , K.-H. 1994. Reservoir-sampling algorithms of time complexity O(n(1 + log(N/n))). ACM Trans. Math. Softw. 20, 4, 481–493. L IN , C.-Y. AND H OVY, E. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In NAACL’03. Association for Computational Linguistics, Morristown, NJ, USA, 71–78. L OSIEWICZ , P., OARD , D. W., AND KOSTOFF , R. N. 2000. Textual data mining to support science and technology management. J. Intell. Inf. Syst. 15, 2, 99–119. M ANBER , U. AND M YERS , G. 1990. Suffix arrays: a new method for on-line string searches. In SODA ’90. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 319–327. M ANBER , U. AND M YERS , G. 1993. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22, 5, 935–948. M ARSAGLIA , G. AND Z AMAN , A. 1987. Toward a universal random number generator. Tech. Rep. FSU-SCRI87-50, Florida State University. M ATSUMOTO , M. AND N ISHIMURA , T. 1998. Mersenne Twister: A 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation 8, 1, 3–30. M C A LLESTER , D. A. AND S CHAPIRE , R. E. 2000. On the convergence rate of Good-Turing estimators. In COLT’00: Proceedings of the Thirteenth Annual Conference on Computational Learning Theory. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1–6. M C C ABE , M. C., L EE , J., C HOWDHURY, A., G ROSSMAN , D., AND F RIEDER , O. 2000. On the design and evaluation of a multi-dimensional approach to information retrieval. In SIGIR ’00. ACM Press, New York, NY, USA, 363–365. M OTHE , J., C HRISMENT, C., D OUSSET, B., AND A LAUX , J. 2003. DocCube: Multi-dimensional visualization and exploration of large document sets. Journal of the American Society for Information Science and Technology 54, 7, 650–659. NADEAU , T. AND T EOREY, T. 2003. A Pareto model for OLAP view size estimation. Information Systems Frontiers 5, 2, 137–147. NAGAO , M. AND M ORI , S. 1994. A new method of n-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese. In COLING’94. 611–615. N IE , J.-Y., G AO , J., Z HANG , J., AND Z HOU , M. 2000. On the use of words and n-grams for Chinese information retrieval. In IRAL’00: Proceedings of the Fifth International Workshop on on Information Retrieval with Asian Languages. ACM Press, New York, NY, USA, 141–148. ¨ N IEMI , T., H IRVONEN , L., AND J ARVELIN , K. 2003. Multidimensional data model and query language for informetrics. Journal of the American Society for Information Science and Technology 54, 10, 939–951. O RLITSKY, A., S ANTHANAM , N., AND Z HANG , J. 2003. Always Good Turing: asymptotically optimal probability estimation. In FOCS’03. 179–188. PAULUS , J. AND K LAPURI , A. 2003. Conventional and periodic n-grams in the transcription of drum sequences. In ICME’03. 737–740.
P ROJECT G UTENBERG L ITERARY A RCHIVE F OUNDATION. 2007. Project Gutenberg. http://www. gutenberg.org/. checked 2007-05-30. RUSKEY, F. 2006. The (combinatorial) object server. http://www.theory.cs.uvic.ca/∼cos/cos.html. checked 2007-05-30. S CHMIDT, J., S IEGEL , A., AND S RINIVASAN , A. 1993. Chernoff-Hoeffding bounds for applications with limited independence. In SODA’93. Society for Industrial and Applied Mathematics Philadelphia, PA, USA, 331–340. S HAH , B., R AMACHANDRAN , K., AND R AGHAVAN , V. 2004. Storage estimation of multidimensional aggregates in a data warehouse environment. In Proceedings of the World Multi-Conference on Systemics, Cybernetics and Informatics. S HANNON , C. E. 1948. A mathematical theory of communications. Bell Syst. Tech. J. S HUKLA , A., D ESHPANDE , P., NAUGHTON , J., AND R AMASAMY, K. 1996. Storage estimation for multidimensional aggregates in the presence of hierarchies. In VLDB’96. 522–531. S U , Z., YANG , Q., L U , Y., AND Z HANG , H. 2000. WhatNext: a prediction system for web requests using n-gram sequence models. In Web Information Systems Engineering 2000. 214–221. S ULLIVAN , D. 2001. Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales. John Wiley & Sons. S UN M ICROSYSTEMS. 2004. String (Java 2 Platform SE 5.0). online documentation: http://java.sun.com/ j2se/1.5.0/docs/api/index.html. TAKABAYASHI , S. 2005. Sary: A suffix array library and tools. online: http://sary.sourceforge.net/. checked 2007-05-30. T HE U NICODE C ONSORTIUM. 2006. Unicode home page. http://unicode.org/. checked 2007-05-30. V ITTER , J. S. 1985. Random sampling with a reservoir. ACM Trans. Math. Softw. 11, 1, 37–57. W EISS , M. 1999. Data Structures and Algorithm Analysis in Java. Addison Wesley. W HANG , K., VANDER -Z ANDEN , B., AND TAYLOR , H. 1990. A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems 15, 2, 208–229. YAMASHITA , T. 2005. SUFARY. online: http://nais.to/∼yto/tools/sufary. checked 2007-05-30. YANNAKOUDAKIS , E. J., T SOMOKOS , I., AND H UTTON , P. J. 1990. n-Grams and their implication to natural language understanding. Pattern Recogn. 23, 5, 509–528. Y U , X., Z UZARTE , C., AND S EVCIK , K. C. 2005. Towards estimating the number of distinct value combinations for a set of attributes. In CIKM’05. ACM Press, New York, NY, USA, 656–663.