Estimating Language Models Using Hadoop and Hbase
Xiaoyang Yu
NI VER
S
E
R
G
O F
H
Y
TH
IT
E
U
D I U N B
Master of Science Artificial Intelligence School of Informatics University of Edinburgh 2008
Abstract This thesis presents the work of building a large scale distributed ngram language model using a MapReduce platform named Hadoop and a distributed database called Hbase. We propose a method focusing on the time cost and storage size of the model, exploring different Hbase table structures and compression approaches. The method is applied to build up training and testing processes using Hadoop MapReduce framework and Hbase. The experiments evaluate and compare different table structures on training 100 million words for unigram, bigram and trigram models, and the results suggest a table based on half ngram structure is a good choice for distributed language model. The results of this work can be applied and further developed in machine translation and other large scale distributed language processing areas.
i
Acknowledgements Many thanks to my supervisor Miles Osborne for the numerous advices, great supports and for inspiring new ideas about this project during our meetings. I would also like to thank my parent for their trusts in me and encouragements. Thanks a lot to Zhao Rui for her great suggestions for the thesis.
ii
Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified.
(Xiaoyang Yu)
iii
Table of Contents
1
Introduction
1
2
Background
3
2.1
Ngram Language Model . . . . . . . . . . . . . . . . . . . . . . . .
3
2.2
Distributed Language Modeling . . . . . . . . . . . . . . . . . . . .
5
2.3
Hadoop MapReduce Framework . . . . . . . . . . . . . . . . . . . .
6
2.4
Hbase Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
3
4
5
Estimating Language Model using Map Reduce
10
3.1
Generate Word Counts . . . . . . . . . . . . . . . . . . . . . . . . .
10
3.2
Generate Count of Counts . . . . . . . . . . . . . . . . . . . . . . .
14
3.3
Generate Good-Turing Smoothing Counts . . . . . . . . . . . . . . .
15
3.4
Generate Ngram Probability . . . . . . . . . . . . . . . . . . . . . .
16
3.5
Generate Hbase Table . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.5.1
n-Gram Based Structure . . . . . . . . . . . . . . . . . . . .
19
3.5.2
Current Word Based Structure . . . . . . . . . . . . . . . . .
20
3.5.3
Context Based Structure . . . . . . . . . . . . . . . . . . . .
21
3.5.4
Half Ngram Based Structure . . . . . . . . . . . . . . . . . .
22
3.5.5
Integer Based Structure . . . . . . . . . . . . . . . . . . . . .
24
3.6
Direct Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.7
Caching Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
Evaluation Methods
31
4.1
Time and Space for Building LM . . . . . . . . . . . . . . . . . . . .
31
4.2
LM Perplexity Comparison . . . . . . . . . . . . . . . . . . . . . . .
32
Experiments
33
5.1
33
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
6
5.2
Ngram Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
5.3
Table Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
Discussion
43
6.1
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
6.2
Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
Bibliography
47
A Source Code
49
A.1 NgramCount.java . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
A.2 TableGenerator.java . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
A.3 NgramModel.java . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
v
Chapter 1 Introduction In statistical natural language processing, ngram language model is widely used in various areas like machine translation and speech recognition etc. Ngram model is trained using large sets of unlabelled texts to estimate a probability model. Generally speaking, more data gives better model. As we can easily get huge training texts from corpus or web, the computational power and storage space of single computer are both limited. Distributed language models are proposed to meet the needs of processing vast amount of data, trying to solve above problems. In the field of distributed language model processing, most of the relevant works mainly concentrated on establishing the methods for model training and testing, but no detailed discussions about the storage structure. The objective of our work is to efficiently construct and store the model in a distributed cluster environment. Moreover, our interest is to explore the capability of coping different database table structures with language models. The novel aspects of our project are as follows. We show that a distributed database can be well integrated with distributed language modeling, providing the input/output data sink for both the distributed training and testing processes. Using distributed database helps to reduce the computational and storage cost. Meanwhile, the choice of database table structure affects the efficiency a lot. We find that half ngram based table structure is a good choice for distributed language models from the aspects of time and space cost, comparing with other structures in our experiments. We use a distributed computing platform called Hadoop, which is an open source implementation of the MapReduce framework proposed by Google [3]. A distributed database named Hbase is used on top of Hadoop, providing a model similar to Google’s Bigtable storage structure [2]. The training methods are based on Google’s work on large language model in machine translation [1], and we propose a different method 1
Chapter 1. Introduction
2
using Good-Turing smoothing with a back-off model and pruning. We store the ngram model in the distributed database with 5 different table structures, and a back-off estimation is performed in a testing process using MapReduce. We propose different table structures based on ngram, current word, context, half ngram and converted integer. Using Hadoop and Hbase, we build up unigram, bigram and trigram models with 100 million words from British National Corpus. All the table structures are evaluated and compared for each ngram order with a testing set of 35,000 words. The training and testing processes are split into several steps. For each step the time and space cost are compared in the experiments. The perplexity of testing set is also compared with traditional language modeling package SRILM [6]. The results are discussed about the choice of proper table structure for the consideration of efficiency. The rest of this thesis is organised as following: Chapter 2 introduces the ngram language model, relative works on distributed language modeling, the Hadoop MapReduce framework and Hbase distributed database; Chapter 3 gives the detail of our methods, illustrating all the steps in MapReduce training and testing tasks, as well as all the different table structures we proposed; Chapter 4 describes the evaluation methods we use; Chapter 5 shows all the experiments and results; Chapter 6 discusses about the choice of table structure for Hadoop/Hbase and possible future works.
Chapter 2 Background In statistical language processing, it is essential to have a language model which measures the probability of how likely a sequence of words may occur in some context. Ngram language model is the major language modeling method, along with smoothing and back-off methods to deal with the problem of data sparseness. Estimating this in a distributed environment enables us to process vast amount of data and get a better language model. Hadoop is a distributed computing framework that can be used for language modeling, and Hbase is a distributed database which may store the model data as database tables and integrate with Hadoop platform.
2.1
Ngram Language Model
Ngram model is the leading method for statistical language processing. It tries to predict the next word wn given n-1 words context w1 , w2 , ..., wn−1 by estimating the probability function: P(wn |w1 , ..., wn−1 )
(2.1)
Usually a Markov assumption is applied that only the prior local context - the last n-1 words - affects the next word. The probability function can be expressed by the frequency of words occurrence in a corpus using Maximum Likelihood Estimation without smoothing: p(wn |w1 , ..., wn−1 ) =
f (w1 , ..., wn ) f (w1 , ...wn−1 )
(2.2)
where f (w1 , ..., wn ) is the counts of how many times we see the sentence w1 , ..., wn in the corpus. One important aspect is the count smoothing, which adjusts the empirical counts collected from training texts to the expected counts for ngrams. Considering 3
Chapter 2. Background
4
ngrams that don’t appear in the training set, but are seen in testing text, for a maximum likelihood estimation the probability would be zero, so we need to find a better estimation for the expected counts. A popular smoothing method called Good-Turing smoothing is based on the count of ngram counts, the expected counts are adjusted with the formula: Nr+1 (2.3) Nr where r is the actual counts, Nr is the count of ngrams that occur exactly r times. r∗ = (r + 1)
For words that are seen frequently, r can be large numbers, and Nr+1 is likely to be zero, in this case derivations can be made to look up Nr+2 , Nr+3 , ...Nr+n , find a nonzero count and use that value instead. The idea of count smoothing is to better estimate probabilities for ngrams, meanwhile, for unseen ngrams, we can to do a back-off estimation using the probability of lower order tokens, which is usually more robust in the model. For a ngram w1 , ..., wn that doesn’t appear in training set, we estimate the probability of n-1 gram w2 , ..., wn instead:
P(wn |w1 , ..., wn−1 ) =
p(wn |w1 , ..., wn−1 )
if found (w1 , ...., wn )
λ(w , ...., w ) ∗ p(w |w , ..., w ) otherwise n 2 1 n−1 n−1 (2.4) where λ(w1 , ..., wn−1 ) is the back-off weight. In general, back-off requires more lookups and computations. Modern back-off smoothing techniques like Kneser-Ney smoothing [5] use more parameters to estimate each ngram probability instead of a simple Maximum Likelihood Estimation. To evaluate the language model, we can calculate the cross entropy and perplexity for testing ngram. A good language model should give a higher probability for one ngram prediction. The cross entropy is the average entropy of each word prediction in the ngram: 1 H(pLM ) = − pLM (w1 , w2 , ..., wn ) n 1 n = − ∑ logpLM (wn |w1 , ..., wn−1 ) n i=1
(2.5)
Perplexity is defined based on cross entropy: PP = 2H(pLM ) 1 n − ∑ logpLM (wn |w1 , ..., wn−1 ) n i=1 = 2
(2.6)
Chapter 2. Background
5
Generally speaking, perplexity is the average number of choices at each word prediction. So the lower perplexity we get, the higher probability of the prediction is, meaning a better language model.
2.2
Distributed Language Modeling
large scale distributed language modeling is quite a new topic. Typically using a server/client diagram as Figure 2.1, we need to efficiently manipulate on large data sets, to communicate with cluster workers and organise their works. The server controls and distributes tasks to all the workers, and clients send requests to server to execute queries or commands. Relative ideals can be seen in Distributed Information Retrieval [7]. Recent works include a method to split a large corpus into many non-overlapping chunks, and make each worker in the cluster load one of the chunks with its suffix array index [8]. The use of suffix array index helps to quickly find the proper context we want, and we can count the words occurrence in each worker simultaneously. In this kind of approach, the raw word counts are stored and served in the distributed system, and the clients collect needed counts and then compute the probability. The system is applied on N-best list re-ranking problem and shows a nice improvement on a 2.97 billion-word corpus.
Figure 2.1: Server/Client Architecture
A similar architecture is proposed later with interpolated models [4]. The corpus
Chapter 2. Background
6
is also split into chunks along with their suffix array index. Also a smoothed n-gram model is computed and stored separately in some other workers. Then the client requests both the raw word counts and the smoothed probabilities from different workers, and computes the final probability by linear weighted blending. The authors apply this approach on N-best list re-ranking and integrate with machine translation decoder to show good improvement in translation quality trained on 4 billion words. Previous two methods solve the problem to store large corpus and provide word counts for clients to estimate probabilities. Later works [1] describe a method to store only the smoothed probabilities for distributed n-gram models. In previous methods, the client needs to look up each worker to find proper context using suffix array index. On the other hand this method result in exactly one worker being contacted per n-gram, and in exactly two workers for context-depended backoff [1]. The authors propose a new backoff method called Stupid Backoff, which is a simpler scheme using only the frequencies and an empirical chosen back-off weight. f (w1 , ..., wn ) if f (w1 , ...., wn ) > 0 f (w1 , ....wn−1 ) P(wn |w1 , ...., wn−1 ) = f (w2 , ..., wn ) α∗ otherwise f (w2 , ....wn−1 )
(2.7)
where α is set to 0.4 based on their earlier experiments [1]. The experiments directly store 5-grams model using different sources. According to their results, 1.8T tokens from web include a 16M vocabulary, and generate a 300G n-grams with a 1.8T language model with their Stupid Backoff. These previous works are the theoretical principles of our project. The last work used Google’s distributed programming model MapReduce [3], but the authors didn’t describe the way to store the data. We adapt the MapReduce model for our work, because it contains a clear workflow and has already been proved as a mature model and widely used in various applications by Google, Yahoo and other companies. Although Google’s own implementation of MapReduce is proprietary, we can still choose open source implementation of this model.
2.3
Hadoop MapReduce Framework
Hadoop is an open source implementation of MapReduce programming model. It is based on Java and uses the Hadoop Distributed File System (HDFS) to create multiple replicas of data blocks for reliability, distributing them around the clusters
Chapter 2. Background
7
and splitting the task into small blocks. According to their website, Hadoop has been demonstrated on 2,000 nodes and is designed up to support 10,000 node clusters, so it enables us to extend our clusters in the future. A general MapReduce architecture1 can be illustrated as Figure 2.2. At first the input files are split into small blocks named FileSplits, and the Map operation is created parallelized with one task per FileSplit.
Figure 2.2: MapReduce Architecture
Input and Output types of a Map-Reduce job: (input)
→map →→ combine →→ reduce → (output) The FileSplit input is treated as a key/value pair, and user specifies a Map function to process the key/value pair to generate a set of intermediate key/value pairs [3]. When the Map operation is finished, the output is passed to Partitioner, usually a hash function, so all the pairs shared with same key can be collected together later on. After the intermediate pairs are generated, a Combine function is called to do a reduce-like job in each Map node to speed up the processing. Then a Reduce function merges all intermediate values associated with the same intermediate key and writes to output files. Map and Reduce operations are working independently on small blocks of data. The final output will be one file per executed reduce task, and Hadoop stores the output files in HDFS. For input text files, each line is parsed as one “value” string, so the Map function is processing at the sentence level; for output files, the format is one key/value pair per one record, and thus if we want to reprocess the output files, the task is working at the record pair level. For language model training using MapReduce, our original inputs 1 http://wiki.apache.org/hadoop/HadoopMapReduce
Chapter 2. Background
8
are text files, and what we will finally get from Hadoop are ngram/probability pairs. It means that, theoretically we can only use Hadoop to build language model. This chapter gives a brief overview of ngram language models, the smoothing and back-off methods, the relevant works on distributed language modeling, the Hadoop framework and Hbase database.
2.4
Hbase Database
Hbase is the distributed database on top of HDFS, whose structure is very similar to Google’s Bigtable model. The reason we look into using Hbase with language model is, although we can do our work all in Hadoop and store all outputs as text or binary files in HDFS, it will consume more time and extra computations for parsing input records. The bottleneck of Hadoop MapReduce is that the input/output is file based, either plain text or binary file, which is reasonable for storing large amount of data but not suitable for query process. For example, if we want to query probability for one ngram, we have to load all the files into map function, parse all of them, and compare the key with ngram to find the probability value. Basically we need to do this comparison for each ngram in test texts, which will cost quite a long time. Instead of parsing files, we can make use of a database structure such as Hbase, to store the ngram probabilities in database table. The advantage is obvious: database structure is designed to meet the needs of multiple queries; the data is indexed and compressed, reducing the storage size; tables can be easily created, modified, updated or deleted. As in distributed database, the table is stored in distributed filesystem, providing scalable and reliable storage. Meanwhile, for language modeling, the model data is highly structured. The basic format is ngram/probability pair, which can be easily constructed into more organised and compact structures. Considering we may get huge amount of data, the compressed structures are essential from both time and storage aspects. Hbase stores data in labelled tables. The table is designed to have a sparse structure. Data is stored in table rows, and each row has an unique key with arbitary number of columns. So maybe one row can have thousands of columns, while another row can have only one column. In addition, a column name is a string with the : form, where is a column family name assigned to a group of columns, and label can be any string you like. The concept of column families is that only administrative operations can modify family names, but user can create
Chapter 2. Background
9
arbitary labels on demand. A conceptual view of Hbase table can be illustrated below: Row Key car
Column color: color:red
Column prize: Column size:
Mini Cooper
medium
color:green Volkswagen Beetle ...
small
BMW
2.6k
...
...
Another important aspect of Hbase is that the table is column oriented, which means physically the tables are stored per column. Each column is stored in one file split, and each column family is stored closely in HDFS, all the empty cells in a column are not stored. This feature actually implies that in Hbase it is less expensive to retrieve a column instead of a row, because to retrieve a row the client must request all the column splits, meanwhile to retrieve a column only one column split is requested, which is basically one single file in HDFS. On the other hand, writing a table is row based. Only one row is locked for updating, so all the writes among clusters can be atomic by default. The relationship between Hadoop, Hbase and HDFS can be illustrated as Figure 2.3:
Figure 2.3: Relationship between Hadoop, Hbase and HDFS
Above overviews are the basic techniques and tools used in this project. The next chapter describes our methods to build up language modeling process for ngram model with smoothing and back-off using Hadoop and Hbase.
Chapter 3 Estimating Language Model using Map Reduce The distributed training process described in Google’s work[1] is split into three steps: convert words to ids, generate ngrams per sentence, and compute ngram probabilities. We extend it to do a Good-Turing smoothing estimation, so extra steps are included to calculate the count of ngram counts, store the count in HDFS, and then try to fetch the data to adjust the raw counts. We decide to store the ngram string directly, considering more training data can be added and updated later on. We estimate a back-off model to compute the probability, and several database table structures are designed for comparison. The testing process is also using MapReduce, acting like a distributed decoder, so we can process mulitiple testing texts together. Figure 3.1 is the flow chart for training process. Figure 3.2 shows a simplified testing process. The rest of this chapter explains the details of each step in the training and testing process.
3.1
Generate Word Counts
The first step is to parse training text, find all the ngrams, and emit their counts. The map function reads one text line as input. The key is the docid, and the value is the text. For each line, it is split into all the unigrams, bigrams, trigrams up to ngrams. These ngrams are the intermediate keys, and the values are a single count 1. Then a combiner function sums up all the values for the same key within Map tasks. And a reduce function same as the combiner collects all the output from combiner, sums up values for the same key. The final key is the same with map function’s output, which 10
Chapter 3. Estimating Language Model using Map Reduce
Figure 3.1: Training Process
11
Chapter 3. Estimating Language Model using Map Reduce
12
Figure 3.2: Testing Process
is the ngram, and the value is the raw counts of this ngram throughout the training data set. A partitioner based on the hashcode of first two words is used, which makes sure that not only all the values with a same key goes into one reduce function, but also the average load is balanced [1]. Also we include a pruning count, any raw counts below this pruning count will be dropped. The simplified map and reduce functions can be illustrated as below: map(LongWritabe docid, Text line, OutputCollector output) { words[] = line.split(blank space or punctuation) for i=1..order for k = 0..(words.length - i) output.collect(words[k..(k+i-1)], 1) } reduce(Text key, Iterator values, OutputCollector output ) { int sum=0; while (values.hasNext()) sum += values.next().get();
Chapter 3. Estimating Language Model using Map Reduce
13
if sum > prune output.collect(key, sum); } Figure 3.3 describes the process of map - combine -reduce functions given some sample text. The combine function starts right after map, so it inherits the same key/value pairs from its previous map task. The output from reduce function is the raw counts, also the keys are sorted, implying some index processes. This feature can be used in enhancements for fast lookups.
Figure 3.3: Generate Raw Counts
Because this step generates all the ngrams, it is possible to collect the total number of unigram counts, bigram counts etc. These numbers are necessary for smoothing techniques. Here we only collect the unigram counts for Good-Turing smoothing. It is easy to collect the total bigram or trigram counts in the similar way, which is be needed for Kneser-Ney smoothing. enum MyCounter {INPUT_WORDS}; reduce(Text key, Iterator values,
Chapter 3. Estimating Language Model using Map Reduce
14
OutputCollector output, Reporter reporter) { ..... if sum>prune output.collect(key, sum); if key is unigram reporter.incrCounter(MyCounter.INPUT_WORDS, 1); }
3.2
Generate Count of Counts
Good-Turing smoothing is based on the count of ngram counts. We need to collect all the count of counts for unigram, bigram, trigram up to ngram. To do this, all the raw counts are loaded into a new MapReduce job. For each ngram, the map function emits one count of the raw counts along with the ngram order, so the output key has the format of , and the value is . The combine and reduce function merge all the count of counts with the same key. The final output should be fairly small, normally a single file is enough to store all the count of counts. The simplified map function is like this: // input key is ngram, input value is the raw counts public void map(Text key, IntWritable value, OutputCollector output){ words[] = toStringArray(key); // words.length is the ngram order // combine the order with the counts String combine = words.length + value.toString(); output.collect(toText(combine), one); } The combine and reduce function are the same with previous step. An example of the MapReduce job is illustrated as Figure 3.4. One important point about the count of counts is that for higher counts, the count of counts is usually incontinuous. This always happens for frequent ngrams, which have a quite high number of counts. We don’t have to store or estimate these empty count of counts, instead we can adjust the smoothing formula being used in the next step.
Chapter 3. Estimating Language Model using Map Reduce
15
Figure 3.4: Generate Count of Counts
3.3
Generate Good-Turing Smoothing Counts
Now we have both the raw counts and the count of counts, following the GoodTuring smoothing formula, we can estimate the smoothed counts for each ngram. In this step, the inputs are still the ngrams and their raw counts, and each map function reads in the count of counts file, stores all the data in a HashMap structure and computes the smoothed counts. The basic formula is : r∗ = (r + 1)
Nr+1 Nr
(3.1)
if we can find both Nr+1 and Nr in the HashMap, then the above formula can be applied directly. Otherwise, we try to look up the “nearest” count of counts, e.g. if we can’t find Nr+1 , we try to look up Nr+2 , then Nr+3 , Nr+4 ..etc. We decide to look up at most 5 counts Nr+1 .. Nr+5 . If we can’t find any of these counts, we use the raw counts intead. In this situation, the raw counts are typically very large, meaning the ngram has a relatively high probability, so we don’t have to adjust the counts. For each ngram, the smoothing process is needed only once, so we actually don’t need any combine or reduce function for count smoothing. The map function can be illustrated as:
Chapter 3. Estimating Language Model using Map Reduce
16
HashMap countHash; void configure(){ reader = openFile(count-of-counts); //k is the ’ngram-order ngram-counts’, v is the count of counts while (reader.next(k, v)) { order = k[0]; counts = k[1..length-1]; countHash.put(order, (counts, v)); } reader.close(); } //input key is the ngram, value is the raw counts void map(Text key, IntWritable value, OutputCollector output) { for i=1..5 if exists countHash.get(key.length).get(value) && exists countHash.get(key.length).get(value+i) { Nr= countHash.get(key.length).get(value); Nr1= countHash.get(key.length).get(value+i); c=(value+1)*Nr1/Nr; break; } else c=value; //c is the smoothed counts for ngram ’key’ } Figure 3.5 shows an example of the Good-Turing smoothing.
3.4
Generate Ngram Probability
To estimate the probability of one ngram w1 , w2 , ..., wn , we need the counts of w1 ..wn and w1 ..wn−1 . Because one MapReduce function, either map or reduce, is working based on one key, in order to collect both above counts, we need to emit the two counts to one reduce function. The approach is to use the current context w1 , w2 , ..., wn−1 as the key, and combine the current word wn with the counts as the
Chapter 3. Estimating Language Model using Map Reduce
17
Figure 3.5: Good-Turing Smoothing
value. This can be done in the map function of previous step. Notice that except for the highest order ngram, all lower order ngrams also need to be emitted with their own smoothed counts, providing the counts for being context as themselves. Figure 3.6 gives an example of the emitted context with current word format. ... //c is the smoothed counts for ngram ’key’ //suffix is the current word, the last item in string array suffix = key[length-1]; context = key[0..length-2]; output.collect(context, (’\t’+suffix+’\t’+c)); if order< highest order output.collect(key, c); ... Now the reduce function can receive both the counts of the context and the counts with current word, and computes the conditional probability for ngram based on the formula: p(wn |w1 , ..., wn−1 ) = //the input k is the context void reduce(Text k, Iterator v ) {
f (w1 , ..., wn ) f (w1 , ...wn−1 )
(3.2)
Chapter 3. Estimating Language Model using Map Reduce
18
Figure 3.6: Emit context with current word
while (v.hasnext) { value=v.get; //if has current word if (value starts with ’\t’) { get current word w get counts C } //if no current word else get base counts Cbase //compute the probability prob= C/Cbase; if prob > 1.0 prob=1.0; } ... } After Good-Turing smoothing, some counts might be quite small, so the probability might be over 1.0. In this case we need to adjust it down to 1.0. For a back-off model, we use the simple scheme proposed by Google [1], in which the back-off weight is set to 0.4. The number 0.4 is chosen based on empirical experiments, and is proved as a stable choice in previous works. If we want to estimate dynamic back-off weight
Chapter 3. Estimating Language Model using Map Reduce
19
Figure 3.7: Estimate probability
for each ngram, more steps are required but people have discussed that the choice of specific back-off or smoothing method is less relevant as the training corpora becomes large [4]. In this step, we get all the probabilities for ngrams, and with the back-off weight, we can estimate probability for testing ngram based on queries. So the next important step is to store these probabilities in the distributed environment.
3.5
Generate Hbase Table
Hbase can be used as the data input/output sink in Hadoop MapReduce jobs, so it is straight-forward to integrate it within previous reduce function. Essential modifications are needed, because writing Hbase table is row based, each time we need to generate a key with some context as the column. There are several different choices, either a simple scheme of each ngram per row, or more structured rows based on current word and context. Two major aspects are considered: the write/query speed and the table storage size.
3.5.1
n-Gram Based Structure
An initial structure is very simple, similar to the text output format. Each n-gram is stored in a separate row, so the table has a flat structure with one single column. For one row, the key is ngram itself, and the column stores its probability. Table 3.1 is an example of this structure. This structure is easy to implement and maintain, yet it is
Chapter 3. Estimating Language Model using Map Reduce
20
only sorted but uncompressed. Considering there might be lots of same probabilities among different ngrams, e.g. higher order ngrams always appear once, the table is not capable to represent all the same probabilities in an efficient way. Also since we only have one column, the advantages of distributed database, arbitary sparse columns, have less effects for this table. key
column family:label (gt:prob)
a
0.11
a big
0.67
a big house
1.0
buy
0.11
buy a
0.67
...
...
Table 3.1: n-Gram based table structure
3.5.2
Current Word Based Structure
Alternatively, for all the ngrams w1 , w2 , ..., wn that share the same current word wn , we can store them in one row with the key wn . All the possible context are stored in seperate columns with the column name format . Table 3.2 is an example. This table is a sparse column structure. For some word with lots of context, the row could be quite long, while for some word with less context the row could be short. In this table we reduce the number of rows, and expand all context into seperate columns. So intead of one single column split, we have lots of small column splits. From the view of distributed database, the data is sparsely stored, but from the view of data structure, it is still somewhat uncompressed, e.g. if we have two current words in two rows, and both of them have the same context, or have the same probability for some context, we still store them seperately. This results in multiple column splits that have very similar structures, which means some kind of redundance. We also need to collect the unigram probability for each current word and store it in a separate column. The process to create table rows can be illustrated as: //k is the ngram, c is the probability
Chapter 3. Estimating Language Model using Map Reduce
key
21
column family:label gt:unigram
gt:a
a
0.15
big
0.057
0.667
house
0.3
0.667
buy
0.15
...
...
gt:a big gt:i gt:buy
..
0.667 1.0 1.0 ...
...
...
...
...
Table 3.2: Word based table structure
for each ngram (k, c) { word=k[length-1]; if k.length == 1 column="gt:unigram"; else{ context=k[0..length-2]; column="gt:context"; } output.collect(w, (column, c) ); }
3.5.3
Context Based Structure
Similar to current word based structure, we can also use the context w1 , w2 , ..., wn−1 as the key for each row, and store all the possible following words wn in seperate columns with the format . This table will have more rows compared with word based table, but still less than ngram based table. For large data set or higher order ngrams, the context can be quite large, on the other hand the columns can be reduced. The column splits are separated by different words, and for all words that only occur once, the split is still very small - only one column value per split. Generally speaking, if we have n unigrams in the data set, we will have around n columns in the table. For a training set containing 100 million words, the number of unigram is around 30,000 , so the table could be really sparse. Example of this table structure is illustrated as Table 3.3. To avoid redundancy, only unigram keys are stored with their own probabilities in
Chapter 3. Estimating Language Model using Map Reduce
key
column family:label gt:unigram
a
22
gt:a
0.11
gt:big gt:i gt:buy gt:the gt:house 0.67
0.67
a big buy
..
1.0 0.11
0.67
0.67
buy a
1.0
i
0.04
...
...
1.0 ...
...
...
...
...
...
...
Table 3.3: Context based table structure
, since we can also store the probability of “a big” in which is redundant with row “a” column . //k is the ngram, c is the probability for each ngram (k, c) { context=k[length-1]; if k.length == 1 column="gt:unigram"; else{ word=k[length-1]; context=k[0..length-2]; column="gt:word"; } output.collect(context, (column, c) ); } Since there may be many columns that only appear once and have the same value, typically seen in higher order ngrams, possible compression can be made to combine these columns together, reducing the column splits.
3.5.4
Half Ngram Based Structure
For the previous two structures, we can get either large number of rows, or large number of columns. So there is a possible trade off between the rows and the columns. We can combine the word based and context based structure together, balancing the
Chapter 3. Estimating Language Model using Map Reduce
23
number of rows and columns. Our method is to split the n grams into n/2 grams, and use n/2 gram as the row key, the rest n/2 gram as the column label. For example, for a 4-gram model (w1 , w2 , w3 , w4 ), the row key is (w1 w2 ), and the column is . An example for 4-gram model is illustrated as Table 3.4: key
column family:label gt:unigram
a
gt:a
0.11
gt:big gt:house gt:big house gt:new house 0.67
a big buy
0.67 1.0
0.11 0.04
...
...
0.01
0.67
buy a i
..
...
...
1.0
0.04
...
...
...
...
Table 3.4: Half ngram based table structure
For higher order ngrams, this structure can reduce lots of rows and insert them as columns. Theoretically the costs of splitting ngram into word-context and contextword are identical, but n/2 gram-n/2 gram will require a bit more parsing. Similar to previous method, the process can be written as: //k is the ngram, c is the probability for each ngram (k, c) { half = int(length/2); context = k[0..half]; if k.length == 1 column="gt:unigram"; else{ word=k[half..length-1]; column="gt:word"; } output.collect(context, (column, c) ); }
Chapter 3. Estimating Language Model using Map Reduce
3.5.5
24
Integer Based Structure
Instead of store all the strings, we can also convert all the words into integers, and store the integers in table. Extra steps are needed to convert each unigram into an unique integer number and keep the convert unigram-integer map in the distributed filesystem. The advantage of using integers is the size will be smaller than strings, because of storing a long length string with a single integer number. But on the other hand we need another encoding/decoding process to do the converting, and this results in more computational time. This method is a trade off between computational time and the storage size. Also this structure can be combined with previous methods for better compressions. For the simple ngram per row scheme, integer based structure can be illustrated as Table 3.5: unigram integer
key
a
1
big
2
house
3
buy
4
column family:label (gt:prob)
1
0.11
12
0.67
123
1.0
4
0.11
41
0.67
...
...
Table 3.5: Integer based table structure
Notice that if we store “a big” as 12, it might conflict with another word with number 12 in the convert map. So we have to add a space between the numbers, similar to the ngram strings. We need an extra MapReduce job to reprocess the raw counts, convert unigram into integer, and store in a table:
Chapter 3. Estimating Language Model using Map Reduce
25
//extra step to convert unigram into unique integer //input key is the ngram, value is the raw counts int id; column=’’convert:integer’’; map(Text k, IntWritable v, OutputCollector output) { if (k.length==1) { id++; output.collect(k, (column, id)); } } ... Then we can query the convert table to store integer instead of strings: column=’’convert:integer’’; //k is the ngram, c is the probability for each ngram (k,c) { for each k[i] in k //query the convert table or file, //find the integer for each k[i] intK[i]=get(k[i], column); column="gt:prob"; row = combineInteger(intK); output.collect(context, (column, c) ); }
3.6
Direct Query
The next process is the testing function. The actual back-off is performed here in the query. Based on the back-off model, for each testing ngram, we need to query the ngram, if not found, then find n-1 gram, until we reach unigram. For different table structures, we just need to generate different row and column names. The advantage of using MapReduce for the testing is that, we can put multiple testing texts into HDFS, and a MapReduce job can process all of them to generate the raw counts, just the same as we have done in the training process, then for each ngram with its counts,
Chapter 3. Estimating Language Model using Map Reduce
26
we directly estimate the probability using back-off model, and multiply by the counts. In such a method, each different ngram is processed only once, which speeds up the whole process especially for lower order ngrams. We call this method Direct Query because we query each ngram directly from Hbase table, so the more testing ngrams we have, the more time it will cost. Also the perplexity of the estimation is computed and collected as an evaluation value for the language model. An example using the simple ngram based table structure is: //the job for raw counts //input key is the docid, value is one line in the text map(LongWritabe docid, Text line, OutputCollector output) { words[] = line.split(blank space or punctuation) for i=1..order for k = 0..(words.length - i) output.collect(words[k..(k+i-1)], 1) } reduce(Text key, Iterator values, OutputCollector output ) { ... } //the job for estimating probability //the input key is the ngram, value is the raw counts column=’’gt:prob’’; map(Text k, IntWritable v, OutputCollector output) { // a back-off calculation row=k.toString; while (finished==false) { value=table.get(row, column); if (value != null) { //find the probability prob=alpha*value;
Chapter 3. Estimating Language Model using Map Reduce
27
finished=true; } else { //unseen unigram if k.length==1 { prob=unseen_prob; finished=true; } else { //back-off to n-1 gram ngram=row; row=ngram[0..length-2]; alpha=alpha*0.4; finished=false; } } } //now we have the probability in prob count+=v; if prob >1.0 prob = 1.0 H+=v*log(prob)/log(2); output.collect(k, prob); } //compute the total perplexity void close(){ perplexity = pow(2.0, -H / count); ... } Figure 3.8 is an example of this direct query process. If the estimated probability is above 1.0, we need to adjust it down to 1.0. The map function collects the total counts and computes the total perplexity in the overrided close method. As a reference, the probability for each ngram is collected in HDFS as the output of final reduce job.
Chapter 3. Estimating Language Model using Map Reduce
28
Figure 3.8: Direct Query
3.7
Caching Query
There is a little more we can do to speed up the query process. Considering queries for different ngram that share the same n-1 gram, in a back-off model we query the ngram first, then if not found, we move on to n-1 gram. Suppose we need to back-off for each ngram, the n-1 gram will be requested for multiple times. Here is where the caching steps in. For each back-off step, we store the n-1 gram probability as HashMap in memory inside the working node. Everytime when node comes to a new n-1 backoff query, it will first look up in the HashMap, if not found, then ask the Hbase table, and add the new n-1 gram into HashMap. The simplified scheme can be written as: ... //k is the ngram HashMap cache; while not finished { prob = table.get(k, column); if prob != null found probability, finished
Chapter 3. Estimating Language Model using Map Reduce
29
else { let row be the n-1 gram from k if cache.exists(row) prob = cache.get(row) else { prob = table.get(row, column); if prob !=null { found probability, finished if number of cache.keys < maxlimit cache.add(row, prob); else cache.clear(); cache.add(row, prob); } } } } ... Figure 3.9 is an example of this caching query process. We don’t need to store probabilities for ngrams, only the n-1 grams. Also there is a maximum limit of the number of keys in the HashMap. We can’t store all the n-1 grams into HashMap, otherwise it will become huge and eat up all the working node’s memory. So we only store up to maxlimit n-1 grams, and when counts are over the limit, the previous HashMap is dropped and filled in new items. It is like an updating process, and another alternative is to delete the first key in HashMap and push in new one. Above methods establish the whole process of distributed language model training and testing. The next chapter describes the evaluation methods we use.
Chapter 3. Estimating Language Model using Map Reduce
Figure 3.9: Caching Query
30
Chapter 4 Evaluation Methods Our major interest is to explore the computational and storage cost building distributed language model using Hadoop MapReduce framework and Hbase distributed database. The evaluation focuses on the comparison of time and space using different table structures. Also as language model the perplexity for a testing set is evaluated and compared with traditional language modeling tools.
4.1
Time and Space for Building LM
There are two major processes we can evaluate, the training process and testing process. Because we are experimenting different table structures, we split the training process into two steps: the first one is generating raw counts and collecting GoodTuring smoothing parameters, the second one is generating table. Apparently the first step is the same for all different tables, so we can focus on the second step for comparison. The comparison of time cost is based on the average value of program running time taken from multiple runs to avoid deviations. Network latency and other disturbance may vary the result, but the error should be remaining in 1-2 minutes level. The program calculates its running time by comparing the current system time before and after the MapReduce job is submitted and executed. Each table structure is compared with the same ngram order, also different ngram order for one table is compared to see the relationship between the order and the time cost in each method. To collect the size of the language model, we can use command-line programs provided in Hadoop framework. Because the Hbase table is actually stored in HDFS filesystem, we can directly calculate the size of the directory created by Hbase. Typ31
Chapter 4. Evaluation Methods
32
ically, Hbase creates a directory called “/hbase” in HDFS, and creates a sub directory with the name of the table, e.g. if we create a table named “trigram”, then we can calculate the size of the directory “/hbase/trigram” as the size of the table. It is not a perfectly accurate estimation of the table data because the directory contains other meta info files but since these files are usually quite small, we can estimate them together. Another point of view is that the table has two parts: the data and the info, so we can calculate these two parts together.
4.2
LM Perplexity Comparison
For a language model, the perplexity for testing set is a common evaluation to see how good the model is. The perplexity means the average choice for each ngram. Generally speaking, the better model it is, the lower perplexity it will get. The order of ngram also affects the perplexity for the same model. For a normal size training set, higher order ngram model will always get lower perplexity. Meanwhile, the more training set we have, the better model we will get, so the perplexity will become lower. We can also compare the distributed language model with traditional language modeling tools like SRILM [6]. SRILM is a rich functional package for building and evaluating language models. The package is written in C++, and because it runs locally in single computer, the processing speed is fast. The shortcoming of SRILM is it will eat up memory and even overflow when processing huge amount of data. Still we can compare SRILM with our distributed language model on the same training set which is not so huge. The smoothing methods need to be nearly the same, e.g. Good-Turing smoothing. The specific parameters may vary, but the similar smoothing method can show whether the distributed language model is stable enough compared with traditional language modeling package. Applying these evaluation methods, the experiments and results are illustrated in the next chapter.
Chapter 5 Experiments The experiments are done in a cluster environment with 2 working nodes and 1 master server. The working nodes are running Hadoop , HDFS and Hbase slaves, and the master server controls all of them. For each experiment, we repeat three times and choose the average value as the result. The results are shown as figures and tables for different ngram orders of all the table structures, the time and space cost, also the perplexity for testing set. We first compare the size and number of unique ngrams for training and testing data taken from a 100 million-word corpus. The time and space cost for generating raw counts, generating table structures, finally the testing queries are compared separately. We also compare the unigram, bigram and trigram models for each step. All of the 5 table structures we proposed are compared on time cost for both training and testing process, as well as the table size of these different structures. Finally the perplexity for each ngram order is compared with SRILM, and the model data size from SRILM is calculated as a reference.
5.1
Data
The data is taken from British National Corpus, which is around 100 million words. We choose some random texts from the corpus as the testing set. The testing set is about 35,000 words. All the remaining texts are used as training set. The corpus is constructed as one sentence per line. For completeness we include all the punctuation in the sentence as the part of ngam, e.g. is parsed as <’s> <door>. The approximate words and file size for training and testing data is: 33
Chapter 5. Experiments
34
training data testing data tokens
110 million
40,000
data size
541MB
202KB
Table 5.1: Data
Notice that the tokens contain both the words and the punctuation, slightly increasing the total number of ngrams.
5.2
Ngram Order
We choose to train up to trigram model for the training process. A counts pruning number of 1 is applied for all the models. Table 5.2 shows the number of unique ngrams for each order on the training and testing data set. Figure 5.1 draws the unique ngram number for training and testing data. We can see from the figure that the training data has a sharper line. Considering the different number of tokens, it shows that larger data set results in more trigrams, meaning more varieties. training data testing data tokens
110 million
40,000
unigram
284,921
7,303
bigram
4,321,467
25,862
trigram
9,090,713
33,120
Table 5.2: Ngram numbers for different order
For all the different table structures, the steps 1 & 2 of raw counting and GoodTuring parameter estimating are identical. So we can evaluate this part first, apart from choosing table types. The time and space costs are compared in Table 5.3 and Table 5.4. For testing process only the raw counting step is required. All the raw counting outputs are stored in HDFS as compressed binary files. The trend line is shown in Figure 5.2. There is a big difference between the two lines. As we can see from Table 5.3, when the tokens are relatively small, the processing time is nearly the same, but when the tokens become large, the difference between unigram, bigram and trigram model is largely increasing. Another aspect affecting the
Chapter 5. Experiments
35
Figure 5.1: Unique ngram number
Figure 5.2: Time cost in raw counting
Chapter 5. Experiments
36
training data
testing data
raw counting & parameter estimating raw counting tokens
110 million
40,000
unigram
15 min 38 sec
19 sec
bigram
41 min 6 sec
20 sec
trigram
75 min 41 sec
22 sec
Table 5.3: Time cost in raw counting
time cost is that during the training process, a second MapReduce job is needed to estimate the count of counts for Good-Turing smoothing. For this job, only one reduce task is launched in the working nodes, producing a single output file. It is easy to know that as the ngram order increases, the input records also become larger, requiring more processing time. training data
testing data
raw counting & parameter estimating raw counting tokens
110 million
40,000
unigram
1.5 MB
35 KB
bigram
22 MB
141 KB
trigram
49 MB
234 KB
Table 5.4: Space cost in raw counting
Table 5.4 shows the size of raw counts file for each ngram order. We can see the size is increasing rapidly, implying the size of the table will also increase sharply. Notice that in Table 5.4 the size is only for each ngram order, but in our table, we need to store all the orders from unigram, bigram up to ngram. So the table’s size is the total sum, further increasing the number. Figure 5.3 suggests that training and testing data have the similar up going trend line, meaning the space cost is less relevant with the number of tokens, and it is like a monotone increasing function with the ngram order.
Chapter 5. Experiments
37
Figure 5.3: Space cost in raw counting
5.3
Table Structures
For each of the 5 table structures, we generate the tables with unigram, bigram and trigram model. Table 5.5 and 5.6 show the time and space cost for training process for each order in generating table ( step 3 ); Table 5.7 and 5.8 shows the time cost for testing process in the querying step with two different methods. unigram
bigram
trigram
type 1: ngram based
5 min 40 sec
62 min 21 sec 300 min 26 sec
type 2: word based
5 min 41 sec
64 min 42 sec 200 min 30 sec
type 3: context based
6 min 40 sec
64 min 39 sec 185 min 20 sec
type 4: half ngram based
6 min 7 sec
60 min 37 sec 190 min 40 sec
type 5: integer based
16 min 26 sec
>240 min
>10 hours
Table 5.5: Time cost in training step 3, generating table
The size of different table structure is calculated based on the size of each directory of the table in HDFS, including the data and meta table info. For type 5, an extra step of converting ngram into integer costs about 5 minutes, this is added in the time cost
Chapter 5. Experiments
38
unigram
bigram
trigram
type 1: ngram based
2.4 MB
44 MB
173 MB
type 2: word based
2.4 MB
60 MB
185 MB
type 3: context based
2.4 MB
44 MB
135 MB
type 4: half ngram based
2.4 MB
44 MB
129 MB
type 5: integer based
4.1 MB
∼40 MB ∼120 MB
Table 5.6: Space cost of different table structures
for type 5. The size of the convertid table is around 2.4MB, this figure is also added in the space cost for type5. Taking into account for the calculation errors caused by network latency, transmission failure, I/O delay and error, the time cost may vary with an error range of ±1 − 2 minutes. Figure 5.4 and 5.5 compares the time and space cost for these five table types. All the tables are created with same compression option in Hbase. As we can see from Figure 5.4 and 5.5, type 5 costs much longer time than other table types, but the space cost reduces. For unigram models, all the first 4 tables behave similarly, which is easy to understand because for unigram all these structures are identical. For bigram models, the training time keeps almost the same except type 5. The space cost keeps reducing through type 3 - 5 but type 2 slightly increases the space cost. The trend is more obvious with trigram models. Type 2 - 4 all make good effort to reduce the training time, type 2 slightly increases the space cost but type 3 - 5 greatly reduce the space cost.
Figure 5.4: Time cost in training step 3, generating table
Chapter 5. Experiments
39
Figure 5.5: Space cost of different table structures
Another comparison can be made to see the trend as the ngram order grows, how the time and space cost will become. Figure 5.6 and 5.7 illustrate the trend line for time cost and space cost with different ngram orders. Because type 5 needs extra MapReduce job and uses two tables, the time cost is much bigger than other types. Type 1 costs a bit more time at trigram model compared with type 2 - 4. The lowest number is from type 4, about 5 minutes less than type 4, that is 3.1% smaller. Meanwhile, the space cost shows a different trend. For trigram model type 1 and 2 climb up to higher numbers while the others are smaller. The lowest number is from type 4, about 4.4% smaller than type 3.
Figure 5.6: Time cost for each ngram order
The next step is probability query for testing data. The time costs for both direct query and caching query are illustrated in Table 5.7 and 5.8. Type 2, the word based structure, costs a bit more compared with type 1, 3 and 4. Similar to training process,
Chapter 5. Experiments
40
Figure 5.7: Space cost for each ngram order
type 5 is still heavy time consuming. Type 1 shows a good result this time, not quite like the performance in training process. Still, we need to tolerate an error range of ±1-2 minutes because of possible interruptions like network problem etc.
type 1: ngram based type 2: word based
unigram
bigram
trigram
5 min 5 sec
30 min 20 sec
40 min 2 sec
11 min 14 sec 48 min 28 sec 88 min 17 sec
type 3: context based
6 min 31 sec
26 min 32 sec 40 min 15 sec
type 4: half ngram based
6 min 33 sec
26 min 4 sec
42 min 4 sec
type 5: integer based
10 min 39 sec
> 50 min
> 1.5 hour
Table 5.7: Time cost in testing, direct query
Figure 5.8 shows the differences between type 1, 3 and 4 are relatively small, considering the deviations that might affect the value. In Figure 5.9 it has the similar trend. Comparing direct query and caching query, we can see that the caching query doesn’t improve much on the time cost as we expected. For bigram and trigram models the time cost is even higher than direct query. But still the comparison between different table types is similar to Figure 5.8, type 3 and type 4 perform better, type 1 is good while type 2 and type 5 fall behind others. Also the perplexity of testing processes compared with SRILM is shown in Table 5.9. For the SRILM package, it uses Good-Turing smoothing as default, and the perplexity is computed using default parameters for unigram, bigram and trigram models. We can see the perplexity from SRILM tools is higher than our model, especially for
Chapter 5. Experiments
41
unigram
bigram
type 1: ngram based
5 min 10 sec
type 2: word based
16 min 58 sec 49 min 10 sec
trigram
24 min 17 sec 33 min 41 sec 87 min 8 sec
type 3: context based
6 min 36 sec
31 min 13 sec 52 min 56 sec
type 4: half ngram based
6 min 31 sec
29 min 45 sec 51 min 13 sec
type 5: integer based
11 min 2 sec
> 50 min
Table 5.8: Time cost in testing, caching query
Figure 5.8: Time cost in direct query
Figure 5.9: Time cost in caching query
>1.5 hour
Chapter 5. Experiments
42
lower ngram order. Another figure from SRILM is the size of the language model data, which is by default 683 MB without compression, but since in Hbase we have used compression, we try to compress the file with gzip and the compressed size is only 161MB, which is a 76.5% compression. Although we can’t compare the original number directly because SRILM also stores other information in that plain text file, obviously increasing the original size, the compressed file can be compared as a reference which is under the similar situation. We can see that the table size of type 3, 4, 5 is still smaller. Hadoop+Hbase 2544.732 599.171 SRILM
3564.03
742.269
517.3108 573.563
Table 5.9: Perplexity for testing data
From our experiments, we can see there are some efficiency differences between different table structures and ngram orders. The next chapter discusses about the proper choice of table structure and future works based on our results.
Chapter 6 Discussion 6.1
Conclusion
From our experiments, we think the table of half ngram based structure is a good choice for distributed language model using Hadoop and Hbase. Considering both the time and space cost, the ngram based structure is too simple to take advantage of distributed features, although this flat structure is fast and easy to manipulate locally and is widely used in traditional language modeling tools like SRILM. The word based structure and context based structure is basically a transposed table pair. Say we have r words with c contexts, word based structure is a table of r rows with c maximum columns, and context based structure is a table of c rows with r maximum columns. So the question is which one fits better for Hbase, which is sparsely distributed and column oriented. In theory the cost of parsing both types are the same, because we emit the context with all the following words as intermediate keys, there is no need for extra parsing, so we only need to compare whether larger rows or larger columns is better. From the experiments we find that context based structure is better, meaning that more rows is less expensive than more columns. Less columns means less column splits, since Hbase stores each column split separately, this will reduce the I/O requests. This may explain why the word based structure has a poor performance on the testing process. Another aspect is the column label names. For word based structure the labels are all the possible contexts from unigram, bigram up to n-1 gram. Meanwhile the context based structure only has all the unigrams as the column labels. Again because Hbase is column oriented, the table needs to maintain an index of all the column labels, and it is better to have simpler label names, especially when we come to higher ngram order like 5-grams. 43
Chapter 6. Discussion
44
The half ngram based structure is a trade-off between word and context based structures. It moves some of the columns in word based structure to generate new rows. So we will have less columns than word based structure, and less rows than context based structure. For lower ngram order, this trade-off doesn’t have significant advantages. Compared with word and context based structure, it will cost a little more time for the extra string parsing, which is reconstructing the intermediate keys into half ngrams. But it benefits from the balancing table structure for the I/O time cost when writing a table. The time cost is not the lowest, but considering as we expand the cluster to more machines, the difference of time cost will be smaller but the more compact table structure will hold more data and give us a better model. The integer based structure turns out to be much less efficient than we expected. The idea of converting string into integer comes from Google’s method to store unigram counts locally on every working node. Here we try to store this mapping in the Hbase table, considering the scalability for huge data sets. But from the experiments we find out that the cost of manipulating two tables in Hbase is too much even for trigram models. Still we can get the approximate table size of this structure, and it indeed shows a decrease as we expected. Looking at the trends, we expect the half ngram structure will perform better for higher ngram order and more training data. The context structure is also a possible choice. However there are many aspects that may affect the training time, e.g. the network connection problem, the hard disk failure, the cpu load etc. The figure in our experiments is computed as an average number for several runs, and each run the training time is different. Generally speaking we can’t avoid the deviation when calculating the time cost, but we can see the trends as the ngram order increases. For the space cost, each run the table size is identical, which means the model is stable and we can rely more on this for evaluation. The table size of half ngram based structure is also smaller compared with SRILM model data size. From this point of view, the half ngram structure is a proper choice both on time and space cost. The caching query doesn’t show big improvements on the time cost. This implies that the way to store a cache in memory and check both the memory and the Hbase table is not efficient in a MapReduce job. Because each working node will only receive some part of the whole data, to store previous data information locally can’t grantee the information is indeed useful for future data, so the node will actually waste some time on searching in the local HashMap but find nothing and still have to look up in the Hbase table.
Chapter 6. Discussion
45
For the perplexity comparison, our model shows a better result than SRILM, this is related with the algorithm implementation differences, e.g. the way to deal with unseen unigram, the way to do pruning etc. SRILM uses a more complex way to compute the probability for unseen unigrams, which is adding 1 to the total and redistribute the probability for all existing unigrams, these require more steps using MapReduce and we think it will be too expensive to add another MapReduce job to update the existing probabilities, so we choose a simpler scheme of adding 1 to the total and not redistributing. This works well when there are not so many unseen unigrams but results poorly for testing data with many unseen testing data.
6.2
Future Works
There are a lot of works that can be done in the future. The Hadoop and Hbase release newer version with lots of bug fixes and enhancements. The first thing we can do is to upgrade our program to newer frameworks to see if there are any changes or improvements. There are lots of aspects that maybe influence the final result, and some of them may be less relevant but some may be quite important. We can try to adjust some configurations in Hadoop and Hbase for a better performance, e.g. the maximum number of map and reduce tasks, the buffer size, the data compression type etc. Also the experiments can be extended to larger cluster with more machines, increasing the computational power to reduce the time cost. It will be very interesting to see and compare the performance for higher ngram order and more data, e.g. 5-gram model with billions of words. Also, for these table structures, possible further compression or encoding can be added. For word or context based structure, there will be many rows that only have one column, and the column values are the same, so it is possible to combine these rows with the same column value as a new row. For the integer based structure, maybe storing the mapping file locally in each working node is better than storing it in Hbase table. Because of the different implementation of Good-Turing algorithm, our model shows a better probability as the lower perplexity compared with SRILM tools, still we can implement more advanced smoothing methods like Kneser-Ney using Hadoop and Hbase for comparison. More MapReduce tasks are needed to compute parameters, also these parameters need to be stored either in HDFS or in Hbase table along with the ngram probability, so the table structures may need improvements, which contains a lot of potential researches.
Chapter 6. Discussion
46
The output of this work can be applied in machine translation and other statistical language processing fields, providing an open source, distributed extension for these researches and powering large scale language models.
Bibliography [1] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 858–867, 2007. [2] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In OSDI ’06, pages 205–218, 2006. [3] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI ’04, pages 137–150, 2004. [4] A. Emami, K. Papineni, and J. Sorensen. Large-scale distributed language modeling. Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, 4:IV–37–IV–40, 15-20 April 2007. [5] R. Kneser and H. Ney. Improved backing-off for m-gram language modeling. Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on, 1:181–184 vol.1, 9-12 May 1995. [6] A. Stolcke. Srilm - an extensible language modeling toolkit. In 7th International Conference on Spoken Language Processing, pages 901–904, September 2002. [7] J. Xu and W. B. Croft. Cluster-based language models for distributed retrieval. In SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 254–261, New York, NY, USA, 1999. ACM. [8] Y. Zhang, A. S. Hildebrand, and S. Vogel. Distributed language modeling for nbest list re-ranking. In Proceedings of the 2006 Conference on Empirical Methods
47
Bibliography
48
in Natural Language Processing, pages 216–223, Sydney, Australia, July 2006. Association for Computational Linguistics.
Appendix A Source Code A.1
NgramCount.java
NgramCount is the class to parse the input texts, generate ngram counts and count of counts. package ngram ; import j a v a . i o . I O E x c e p t i o n ; import j a v a . u t i l . A r r a y L i s t ; import j a v a . u t i l . I t e r a t o r ; import j a v a . u t i l . L i s t ; import j a v a . u t i l . r e g e x . P a t t e r n ; import j a v a . u t i l . r e g e x . M a t c h e r ; import o r g . a p a c h e . hadoop . c o n f . C o n f i g u r a t i o n ; import o r g . a p a c h e . hadoop . c o n f . C o n f i g u r e d ; import o r g . a p a c h e . hadoop . f s . F i l e S y s t e m ; import o r g . a p a c h e . hadoop . f s . P a t h ; import o r g . a p a c h e . hadoop . i o . I n t W r i t a b l e ; import o r g . a p a c h e . hadoop . i o . L o n g W r i t a b l e ; import o r g . a p a c h e . hadoop . i o . T e x t ; import o r g . a p a c h e . hadoop . i o . S e q u e n c e F i l e ; import o r g . a p a c h e . hadoop . i o . S e q u e n c e F i l e . C o m p r e s s i o n T y p e ; import o r g . a p a c h e . hadoop . mapred . J o b C l i e n t ; import o r g . a p a c h e . hadoop . mapred . R u n n i n g J o b ; 49
Appendix A. Source Code
50
import o r g . a p a c h e . hadoop . mapred . J o b C o n f ; import o r g . a p a c h e . hadoop . mapred . MapReduceBase ; import o r g . a p a c h e . hadoop . mapred . Mapper ; import o r g . a p a c h e . hadoop . mapred . O u t p u t C o l l e c t o r ; import o r g . a p a c h e . hadoop . mapred . P a r t i t i o n e r ; import o r g . a p a c h e . hadoop . mapred . R e d u c e r ; import o r g . a p a c h e . hadoop . mapred . R e p o r t e r ; import o r g . a p a c h e . hadoop . mapred . C o u n t e r s ; import o r g . a p a c h e . hadoop . mapred . S e q u e n c e F i l e I n p u t F o r m a t ; import o r g . a p a c h e . hadoop . mapred . S e q u e n c e F i l e O u t p u t F o r m a t ; import o r g . a p a c h e . hadoop . u t i l . T o o l ; import o r g . a p a c h e . hadoop . u t i l . T o o l R u n n e r ; /∗ ∗ ∗ parse the t e x t
f i l e s t o c o u n t a l l t h e ngrams and
∗ e s t i m a t e c o u n t o f c o u n t s f o r Good−T u r i n g s m o o t h i n g ∗ ∗ @author X i a o y a n g Yu ∗ ∗/ p u b l i c c l a s s NgramCount e x t e n d s C o n f i g u r e d implements Tool { p r o t e c t e d f i n a l s t a t i c S t r i n g GTPARA = ” g t −p a r a m e t e r ” ; p r o t e c t e d f i n a l s t a t i c S t r i n g RAWCOUNTDIR = ” raw−c o u n t s ”; / / u s e t h e c o u n t e r t o c o l l e c t t h e t o t a l number o f unigrams p r o t e c t e d s t a t i c enum MyCounter { INPUT WORDS }; /∗ ∗ ∗ t h e map c l a s s t o p a r s e e a c h l i n e i n t o ngrams ∗
Appendix A. Source Code
51
∗/ p u b l i c s t a t i c c l a s s countMap e x t e n d s MapReduceBase implements Mapper { private int order ; p r i v a t e S t r i n g words ; p r i v a t e f i n a l s t a t i c I n t W r i t a b l e one = new IntWritable (1) ; //
r e t r i e v e ngram o r d e r f r o m j o b c o n f i g u r a t i o n
p u b l i c void c o n f i g u r e ( JobConf job ) { order = job . g e t I n t ( ” order ” , 3) ; } / / e m i t a l l t h e unigram , b i g r a m . . up t o ngram p u b l i c v o i d map ( L o n g W r i t a b l e key , T e x t v a l u e , O u t p u t C o l l e c t o r o u t p u t , Reporter reporter ) throws I O E x c e p t i o n { String line = value . toString () ; i f ( l i n e == n u l l | | l i n e . t r i m ( ) . l e n g t h ( ) == 0 ) return ; P a t t e r n m y P a t t e r n = P a t t e r n . c o m p i l e ( ” \\w+ | \ \ p{ P u n c t }+ ” ) ; M a t c h e r myMatcher = m y P a t t e r n . m a t c h e r ( l i n e ) ; S t r i n g B u f f e r s b = new S t r i n g B u f f e r ( ) ; w h i l e ( myMatcher . f i n d ( ) ) { s b . a p p e n d ( myMatcher . g r o u p ( ) ) . a p p e n d ( ” \ t ” ) ; } l i n e = sb . t o S t r i n g ( ) ; S t r i n g [ ] c u r r e n t = l i n e . s p l i t ( ” \\ s +” ) ; int i = 0; i f ( order > 1) c u r r e n t = ( ”<s> ” + l i n e ) . s p l i t ( ” \\ s +” ) ; words = ” ” ;
Appendix A. Source Code
52
f o r ( i = 1 ; i <= o r d e r ; i ++) { f o r ( i n t k = 0 ; k <= c u r r e n t . l e n g t h − i ; k ++) { words = c u r r e n t [ k ] ; i f ( words . l e n g t h ( ) ! = 0 ) { f o r ( i n t j = k + 1 ; j < k + i ; j ++) words = words + ” ” + c u r r e n t [ j ] ; o u t p u t . c o l l e c t ( new T e x t ( words ) , one ) ; } } } } } /∗ ∗ ∗ A c o m b i n e r c l a s s t h a t j u s t e m i t s t h e sum o f t h e ∗ input values . ∗/ p u b l i c s t a t i c c l a s s c o u nt C o m b i ne e x t e n d s MapReduceBase implements Reducer { p u b l i c v o i d r e d u c e ( T e x t key , I t e r a t o r values , O u t p u t C o l l e c t o r o u t p u t , Reporter reporter ) throws I O E x c e p t i o n { i n t sum = 0 ; while ( val ues . hasNext ( ) ) { sum += v a l u e s . n e x t ( ) . g e t ( ) ; } o u t p u t . c o l l e c t ( key , new I n t W r i t a b l e ( sum ) ) ; } } /∗ ∗ ∗ A r e d u c e r c l a s s t h a t j u s t e m i t s t h e sum o f t h e i n p u t
Appendix A. Source Code
53
∗ v a l u e s , a l s o do t h e c o u n t s p r u n i n g , c o l l e c t t h e ∗ t o t a l number o f u n i g r a m s . ∗/ p u b l i c s t a t i c c l a s s c o u n t R e d u c e e x t e n d s MapReduceBase implements Reducer { private i n t prune ; p u b l i c void c o n f i g u r e ( JobConf job ) { prune = job . g e t I n t ( ” prune ” , 0) ; } p u b l i c v o i d r e d u c e ( T e x t key , I t e r a t o r values , O u t p u t C o l l e c t o r o u t p u t , Reporter reporter ) throws I O E x c e p t i o n { i n t sum = 0 ; while ( val ues . hasNext ( ) ) { sum += v a l u e s . n e x t ( ) . g e t ( ) ; } i f ( sum > p r u n e ) { o u t p u t . c o l l e c t ( key , new I n t W r i t a b l e ( sum ) ) ; /∗ ∗ ∗ c a l c u l a t e t h e t o t a l number , o n l y c o u n t s ∗ unigram , i g n o r e <s> ∗/ S t r i n g [ ] c u r r e n t = key . t o S t r i n g ( ) . s p l i t ( ” \\ s +” ) ; i f ( c u r r e n t . l e n g t h == 1 && ! c u r r e n t [ 0 ] . e q u a l s ( ”<s >” ) ) r e p o r t e r . i n c r C o u n t e r ( MyCounter . INPUT WORDS , sum ); } }
Appendix A. Source Code
54
} /∗ ∗ ∗ a p a r t i t i o n c l a s s t o b a l a n c e t h e l o a d b a s e d on f i r s t ∗ two words . ∗ ∗/ public s t a t i c c l a s s c o u n t P a r t i t i o n extends MapReduceBase implements P a r t i t i o n e r { p u b l i c i n t g e t P a r t i t i o n ( T e x t key , I n t W r i t a b l e v a l u e , int numPartitions ) { S t r i n g [ ] l i n e = key . t o S t r i n g ( ) . s p l i t ( ” ” ) ; S t r i n g p r e f i x = ( l i n e . length > 1) ? ( l i n e [ 0] + l i n e [1]) : line [0]; r e t u r n p r e f i x . hashCode ( ) % n u m P a r t i t i o n s ; } } /∗ ∗ ∗ t h e s e c o n d t a s k i s t o c o u n t t h e c o u n t −o f −c o u n t s f o r ∗ Good−T u r i n g s m o o t h i n g . ∗ ∗ a map c l a s s t o p a r s e t h e raw c o u n t s and e m i t c o u n t ∗ o f c o u n t s w i t h t h e ngram o r d e r ∗/ p u b l i c s t a t i c c l a s s MapB e x t e n d s MapReduceBase implements Mapper { p r i v a t e f i n a l s t a t i c I n t W r i t a b l e one = new IntWritable (1) ; p u b l i c v o i d map ( T e x t key , I n t W r i t a b l e v a l u e , O u t p u t C o l l e c t o r o u t p u t ,
Appendix A. Source Code
55
R e p o r t e r r e p o r t e r ) throws I O E x c e p t i o n { i f ( ! key . t o S t r i n g ( ) . e q u a l s ( ”<s>” ) ) { S t r i n g [ ] r e s u l t = key . t o S t r i n g ( ) . s p l i t ( ” \\ s +” ) ; S t r i n g combine = r e s u l t . l e n g t h + v a l u e . t o S t r i n g ( ) ; o u t p u t . c o l l e c t ( new I n t W r i t a b l e ( I n t e g e r . p a r s e I n t ( combine ) ) , one ) ; } } } /∗ ∗ ∗ a r e d u c e c l a s s t o e m i t t h e sums o f i n p u t v a l u e s ∗ ∗/ p u b l i c s t a t i c c l a s s ReduceB e x t e n d s MapReduceBase implements Reducer { p u b l i c v o i d r e d u c e ( I n t W r i t a b l e key , I t e r a t o r < IntWritable > values , O u t p u t C o l l e c t o r o u t p u t , R e p o r t e r r e p o r t e r ) throws I O E x c e p t i o n { i n t sum = 0 ; while ( val ues . hasNext ( ) ) { sum += v a l u e s . n e x t ( ) . g e t ( ) ; } o u t p u t . c o l l e c t ( key , new I n t W r i t a b l e ( sum ) ) ; } } static int printUsage () { System . o u t . p r i n t l n ( ” u s a g e : [ OPTIONS ] < i n p u t d i r >”
Appendix A. Source Code
56
+ ”\ nOptions include : ” ) ; System . o u t . p r i n t l n ( ”−o r d e r :\ t t h e max ngram o r d e r \n \ t d e f a u l t :3\ n” + ”−p r u n e :\ t t h e p r u n i n g c o u n t s \ n\ t d e f a u l t :0\ n” ) ; r e t u r n −1; } p u b l i c i n t r u n ( S t r i n g [ ] a r g s ) throws E x c e p t i o n { int order = 3; i n t prune = 0; L i s t <S t r i n g > o t h e r a r g s = new A r r a y L i s t <S t r i n g >() ; f o r ( i n t i = 0 ; i < a r g s . l e n g t h ; ++ i ) { try { i f ( ”−o r d e r ” . e q u a l s ( a r g s [ i ] ) ) { o r d e r = I n t e g e r . p a r s e I n t ( a r g s [++ i ] ) ; System . o u t . p r i n t l n ( ” t h e o r d e r i s s e t t o : ” + args [ i ]) ; } e l s e i f ( ”−p r u n e ” . e q u a l s ( a r g s [ i ] ) ) { p r u n e = I n t e g e r . p a r s e I n t ( a r g s [++ i ] ) ; System . o u t . p r i n t l n ( ” u s e c o u n t p r u n i n g : ” + a r g s [ i ]) ; } else { o t h e r a r g s . add ( a r g s [ i ] ) ; } } catch ( NumberFormatException e x c e p t ) { System . o u t . p r i n t l n ( ”ERROR : I n t e g e r e x p e c t e d i n s t e a d of ” + args [ i ]) ; return printUsage ( ) ; } catch ( ArrayIndexOutOfBoundsException except ) { System . o u t . p r i n t l n ( ”ERROR : R e q u i r e d p a r a m e t e r m i s s i n g from ” + args [ i − 1]) ;
Appendix A. Source Code
57
return printUsage ( ) ; } } / / Make s u r e t h e r e a r e e x a c t l y 1 p a r a m e t e r s l e f t . i f ( o t h e r a r g s . s i z e ( ) != 1) { System . o u t . p r i n t l n ( ”ERROR : Wrong number o f parameters : ” + args . length + ” i n s t e a d of 1. ” ) ; return printUsage ( ) ; } J o b C o n f jobA = new J o b C o n f ( g e t C o n f ( ) , NgramCount . class ) ; jobA . s e t J o b N a m e ( ” n g r a m c o u n t ” ) ; / / t h e k e y s a r e words ( s t r i n g s ) jobA . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ; / / the values are counts ( i n t s ) jobA . s e t O u t p u t V a l u e C l a s s ( I n t W r i t a b l e . c l a s s ) ; jobA . s e t M a p p e r C l a s s ( countMap . c l a s s ) ; jobA . s e t C o m b i n e r C l a s s ( c o u nt C o m b in e . c l a s s ) ; jobA . s e t P a r t i t i o n e r C l a s s ( c o u n t P a r t i t i o n . c l a s s ) ; jobA . s e t R e d u c e r C l a s s ( c o u n t R e d u c e . c l a s s ) ; jobA . s e t I n t ( ” o r d e r ” , o r d e r ) ; jobA . s e t I n t ( ” p r u n e ” , p r u n e ) ; jobA . s e t I n p u t P a t h ( new P a t h ( o t h e r a r g s . g e t ( 0 ) ) ) ; jobA . s e t O u t p u t F o r m a t ( S e q u e n c e F i l e O u t p u t F o r m a t . c l a s s ) ; jobA . s e t O u t p u t P a t h ( new P a t h (RAWCOUNTDIR) ) ; i f ( F i l e S y s t e m . g e t ( jobA ) . e x i s t s ( new P a t h (RAWCOUNTDIR) )) F i l e S y s t e m . g e t ( jobA ) . d e l e t e ( new P a t h (RAWCOUNTDIR) ) ; S e q u e n c e F i l e O u t p u t F o r m a t . s e t C o m p r e s s O u t p u t ( jobA , t r u e );
Appendix A. Source Code
58
SequenceFileOutputFormat . setOutputCompressionType ( jobA , C o m p r e s s i o n T y p e . BLOCK) ; /∗ ∗ ∗ record the running time ∗/ l o n g t 1 = System . c u r r e n t T i m e M i l l i s ( ) ; R u n n i n g J o b r j = J o b C l i e n t . r u n J o b ( jobA ) ; Counters cs = r j . getCounters ( ) ; try { P a t h o u t F i l e = new P a t h ( ”num−of−u n i g r a m s ” ) ; i f ( F i l e S y s t e m . g e t ( jobA ) . e x i s t s ( o u t F i l e ) ) F i l e S y s t e m . g e t ( jobA ) . d e l e t e ( o u t F i l e ) ; SequenceFile . Writer writer = SequenceFile . c r e a t e W r i t e r ( F i l e S y s t e m . g e t ( jobA ) , jobA , o u t F i l e , Text . class , LongWritable . c l a s s ) ; System . o u t . p r i n t l n ( ” t o t a l i s : ”+ c s . g e t C o u n t e r ( MyCounter . INPUT WORDS ) ) ; w r i t e r . a p p e n d ( new T e x t ( ” t o t a l ” ) , new L o n g W r i t a b l e ( c s . g e t C o u n t e r ( MyCounter . INPUT WORDS ) ) ) ; writer . close () ; } catch ( IOException e ) { System . o u t . p r i n t l n ( ” Can ’ t w r i t e f i l e : num−of− u n i g r a m s f o r t h e ngram model . E x i t . ” ) ; r e t u r n −1; } System . o u t . p r i n t l n ( ” J o b 1 f i n i s h e d . Now im s t a r t i n g job 2” ) ; /∗ ∗ ∗ t h e s e c o n d t a s k i s t o c o u n t t h e c o u n t −o f −c o u n t s ∗ f o r Good−T u r i n g s m o o t h i n g ∗ ∗/ J o b C o n f jobB = new J o b C o n f ( g e t C o n f ( ) , NgramCount . class ) ;
Appendix A. Source Code
59
jobB . s e t I n t ( ” o r d e r ” , o r d e r ) ; jobB . s e t I n p u t F o r m a t ( S e q u e n c e F i l e I n p u t F o r m a t . c l a s s ) ; jobB . s e t O u t p u t F o r m a t ( S e q u e n c e F i l e O u t p u t F o r m a t . c l a s s ) ; jobB . s e t O u t p u t K e y C l a s s ( I n t W r i t a b l e . c l a s s ) ; jobB . s e t O u t p u t V a l u e C l a s s ( I n t W r i t a b l e . c l a s s ) ; jobB . s e t M a p p e r C l a s s ( MapB . c l a s s ) ; jobB . s e t C o m b i n e r C l a s s ( ReduceB . c l a s s ) ; jobB . s e t R e d u c e r C l a s s ( ReduceB . c l a s s ) ; jobB . setNumReduceTasks ( 1 ) ; / / i n p u t i s t h e raw c o u n t s jobB . s e t I n p u t P a t h ( new P a t h (RAWCOUNTDIR) ) ; / / output i s count of counts P a t h t m p D i r = new P a t h (GTPARA) ; jobB . s e t O u t p u t P a t h ( t m p D i r ) ; i f ( F i l e S y s t e m . g e t ( jobB ) . e x i s t s ( t m p D i r ) ) F i l e S y s t e m . g e t ( jobB ) . d e l e t e ( t m p D i r ) ; S e q u e n c e F i l e O u t p u t F o r m a t . s e t C o m p r e s s O u t p u t ( jobB , t r u e ); SequenceFileOutputFormat . setOutputCompressionType ( jobB , C o m p r e s s i o n T y p e . BLOCK) ; J o b C l i e n t . r u n J o b ( jobB ) ; System . o u t . p r i n t l n ( ” J o b 2 f i n i s h e d ! Now you c a n s t a r t hbase t a b l e g e n e r a t o r . ” ) ; l o n g t 2 = System . c u r r e n t T i m e M i l l i s ( ) ; long sec = ( t2 − t1 ) / 1000; / / seconds System . o u t . p r i n t l n ( ” r u n n i n g t i m e i s : ” + s e c / 60 + ” m i n u t e s ” + s e c% 60 + ” s e c o n d s ” ) ; return 0; } p u b l i c s t a t i c v o i d main ( S t r i n g [ ] a r g s ) throws E x c e p t i o n { i n t e r r C o d e = T o o l R u n n e r . r u n ( new C o n f i g u r a t i o n ( ) , new NgramCount ( ) , args ) ;
Appendix A. Source Code
60
System . e x i t ( e r r C o d e ) ; } }
A.2
TableGenerator.java
TableGenerator is the class to estimate ngram probability and store it in different Hbase tables. package ngram ; import j a v a . i o . I O E x c e p t i o n ; import j a v a . u t i l . A r r a y L i s t ; import j a v a . u t i l . HashMap ; import j a v a . u t i l . I t e r a t o r ; import j a v a . u t i l . L i s t ; import j a v a . u t i l . Map . E n t r y ; import o r g . a p a c h e . hadoop . c o n f . C o n f i g u r a t i o n ; import o r g . a p a c h e . hadoop . c o n f . C o n f i g u r e d ; import o r g . a p a c h e . hadoop . f s . F i l e S y s t e m ; import o r g . a p a c h e . hadoop . f s . P a t h ; import o r g . a p a c h e . hadoop . h b a s e . H B a s e C o n f i g u r a t i o n ; import o r g . a p a c h e . hadoop . h b a s e . HTable ; import o r g . a p a c h e . hadoop . h b a s e . i o . I m m u t a b l e B y t e s W r i t a b l e ; import o r g . a p a c h e . hadoop . h b a s e . mapred . T a b l e R e d u c e ; import o r g . a p a c h e . hadoop . i o . I n t W r i t a b l e ; import o r g . a p a c h e . hadoop . i o . L o n g W r i t a b l e ; import o r g . a p a c h e . hadoop . i o . M a p W r i t a b l e ; import o r g . a p a c h e . hadoop . i o . S e q u e n c e F i l e ; import o r g . a p a c h e . hadoop . i o . T e x t ; import o r g . a p a c h e . hadoop . mapred . J o b C l i e n t ; import o r g . a p a c h e . hadoop . mapred . J o b C o n f ; import o r g . a p a c h e . hadoop . mapred . MapReduceBase ; import o r g . a p a c h e . hadoop . mapred . Mapper ;
Appendix A. Source Code
61
import o r g . a p a c h e . hadoop . mapred . O u t p u t C o l l e c t o r ; import o r g . a p a c h e . hadoop . mapred . R e p o r t e r ; import o r g . a p a c h e . hadoop . mapred . S e q u e n c e F i l e I n p u t F o r m a t ; import o r g . a p a c h e . hadoop . u t i l . T o o l ; import o r g . a p a c h e . hadoop . u t i l . T o o l R u n n e r ; /∗ ∗ ∗ t h e t h i r d t a s k i s t o c a l c u l a t e t h e Good−T u r i n g ∗ smoothing counts for t r a i n i n g set , e s t i m a t e ∗ t h e c o n d i t i o n a l p r o b a b i l i t y and s t o r e i t i n Hbase ∗ table ∗ ∗ @author X i a o y a n g Yu ∗/ p u b l i c c l a s s T a b l e G e n e r a t o r e x t e n d s C o n f i g u r e d implements Tool { p r o t e c t e d f i n a l s t a t i c S t r i n g GTPARA = ” g t −p a r a m e t e r ” ; p r o t e c t e d f i n a l s t a t i c S t r i n g RAWCOUNTDIR = ” raw−c o u n t s ”; /∗ ∗ ∗ a map c l a s s t o r e a d i n raw c o u n t s and c o u n t o f ∗ c o u n t s , c o m p u t e t h e Good−T u r i n g s m o o t h i n g c o u n t s ∗ ∗/ p u b l i c s t a t i c c l a s s e s t i m a t e M a p e x t e n d s MapReduceBase implements Mapper { int order ; HashMap> c o u n t H a s h ; p u b l i c void c o n f i g u r e ( JobConf job ) { order = job . g e t I n t ( ” order ” , 3) ;
Appendix A. Source Code
62
c o u n t H a s h = new HashMap>() ; P a t h t m p F i l e = new P a t h ( new P a t h (GTPARA) , ” p a r t −00000 ” ) ; / / open t h e c o u n t o f c o u n t s f i l e t o r e a d a l l c o u n t / / o f c o u n t s i n t o hashmap try { FileSystem f i l e S y s = FileSystem . get ( job ) ; S e q u e n c e F i l e . R e a d e r r e a d e r = new S e q u e n c e F i l e . Reader ( f i l e S y s , tmpFile , job ) ; I n t W r i t a b l e k = new I n t W r i t a b l e ( ) , v = new IntWritable () ; while ( r e a d e r . next ( k , v ) ) { Integer order = Integer . parseInt (k . toString () . substring (0 ,1) ) ; I n t e g e r key = I n t e g e r . p a r s e I n t ( k . t o S t r i n g ( ) . substring (1) ) ; i f ( countHash . containsKey ( o r d e r ) ) { HashMap tmp = c o u n t H a s h . g e t ( order ) ; tmp . p u t ( key , new I n t e g e r ( v . g e t ( ) ) ) ; c o u n t H a s h . p u t ( o r d e r , tmp ) ; } else { HashMap tmp = new HashMap< I n t e g e r , I n t e g e r >() ; tmp . p u t ( key , new I n t e g e r ( v . g e t ( ) ) ) ; c o u n t H a s h . p u t ( o r d e r , tmp ) ; } } reader . close () ; } catch ( IOException e ) { System . o u t . p r i n t l n ( ” c a n t open p a r a m e t e r f i l e ! ” ) ; } ; }
Appendix A. Source Code
63
p u b l i c v o i d map ( T e x t key , I n t W r i t a b l e v a l u e , O u t p u t C o l l e c t o r o u t p u t , R e p o r t e r r ) throws I O E x c e p t i o n { Integer v = value . get () ; float c = v . floatValue () ; S t r i n g [ ] words = key . t o S t r i n g ( ) . s p l i t ( ” \\ s +” ) ; I n t e g e r o r d e r = words . l e n g t h ; i f ( order > this . order ) return ; /∗ ∗ ∗ r ∗=( r +1) ∗Nr +1/ Nr t h e c o u n t o f c o u n t s Nr m u s t be ∗ e x i s t e d , o t h e r w i s e we u s e t h e c l o s e s t c o u n t s ∗ instead ∗/ f o r ( i n t i = 1 ; i < 5 ; i ++) { i f ( countHash . get ( o r d e r ) . containsKey ( v ) && c o u n t H a s h . g e t ( o r d e r ) . c o n t a i n s K e y ( v + i ) ) { I n t e g e r Nr1 = c o u n t H a s h . g e t ( o r d e r ) . g e t ( v + i ) ; I n t e g e r Nr = c o u n t H a s h . g e t ( o r d e r ) . g e t ( v ) ; c = ( v . f l o a t V a l u e ( ) + 1 f ) ∗ Nr1 . f l o a t V a l u e ( ) / Nr . f l o a t V a l u e ( ) ; break ; } } /∗ ∗ ∗ t h e G−T s m o o t h e d c o u n t s w i t h T e x t k e y i s c , t h e n ∗ we e m i t t h e t o k e n b a s e d on t h e l a s t n−1 gram a s ∗ t h e new k e y ∗/ i f ( order > 1) { S t r i n g s u f f i x = words [ o r d e r − 1 ] ; S t r i n g word = words [ 0 ] ; f o r ( i n t i = 1 ; i < o r d e r − 1 ; i ++) { word = word + ” ” + words [ i ] ;
Appendix A. Source Code
64
} o u t p u t . c o l l e c t ( new T e x t ( word ) , new T e x t ( ” \ t ” + s u f f i x + ”\ t ” + Float . t o S t r i n g ( c ) ) ) ; } i f ( o r d e r == 1 | | ( o r d e r > 1 && o r d e r < t h i s . o r d e r ) ) o u t p u t . c o l l e c t ( key , new T e x t ( F l o a t . t o S t r i n g ( c ) ) ) ; } } /∗ ∗ ∗ a map c l a s s n e e d e d by i n t e g e r b a s e d t a b l e s t r u c t u r e , ∗ c o n v e r t the unigrams i n t o unique i n t e g e r s ∗ ∗/ p u b l i c s t a t i c c l a s s c o n v e r t M a p e x t e n d s MapReduceBase implements Mapper { Long i d = 0 l ; /∗ ∗ ∗ t h e i n p u t k e y i s ngram , v a l u e i s t h e raw c o u n t s ∗ t h e o u t p u t k e y i s unigram , v a l u e i s t h e u n i q u e ∗ integer id ∗/ p u b l i c v o i d map ( T e x t key , I n t W r i t a b l e v a l u e , O u t p u t C o l l e c t o r o u t p u t , Reporter r ) throws I O E x c e p t i o n { S t r i n g [ ] words = key . t o S t r i n g ( ) . s p l i t ( ” \\ s +” ) ; i f ( words . l e n g t h == 1 ) { o u t p u t . c o l l e c t ( key , new L o n g W r i t a b l e (++ i d ) ) ; } } }
Appendix A. Source Code
65
/∗ ∗ ∗ a r e d u c e c l a s s n e e d e d by i n t e g e r b a s e d t a b l e ∗ s t r u c t u r e , s t o r e t h e unigram−i n t e r g e r mapping i n ∗ Hbase t a b l e ∗ ∗/ p u b l i c s t a t i c c l a s s c o n v e r t T a b l e extends TableReduce< Text , L o n g W r i t a b l e > { / / t h e row k e y i s t h e u n i g r a m T e x t k , t h e column i s convert : id p u b l i c v o i d r e d u c e ( T e x t k , I t e r a t o r v , O u t p u t C o l l e c t o r o u t p u t , Reporter r ) throws I O E x c e p t i o n { Long i d = v . n e x t ( ) . g e t ( ) ; M a p W r i t a b l e mw = new M a p W r i t a b l e ( ) ; T e x t column = new T e x t ( ” c o n v e r t : i d ” ) ; mw. p u t ( column , new I m m u t a b l e B y t e s W r i t a b l e ( Long . toString ( id ) . getBytes () ) ) ; o u t p u t . c o l l e c t ( k , mw) ; } } /∗ ∗ ∗ e s t i m a t e and s t o r e t h e p r o b a b i l i t y i n ngram b a s e d ∗ table ∗ ∗/ public s t a t i c c l a s s simpleTableReduce extends T a b l e R e d u c e { Long t o t a l ; p u b l i c void c o n f i g u r e ( JobConf job ) {
Appendix A. Source Code
66
t o t a l = job . getLong ( ” t o t a l ” , 1) ; } p u b l i c v o i d r e d u c e ( T e x t k , I t e r a t o r v , O u t p u t C o l l e c t o r o u t p u t , Reporter r ) throws I O E x c e p t i o n { Float base = 1 f ; HashMap<S t r i n g , F l o a t > c o u n t s = new HashMap<S t r i n g , F l o a t >() ; while ( v . hasNext ( ) ) { String record = v . next ( ) . t oStri ng ( ) ; i f ( record . s t a r t s W i t h ( ”\ t ” ) ) { S t r i n g [ ] tmp = r e c o r d . s p l i t ( ” \ t ” ) ; c o u n t s . p u t ( tmp [ 1 ] , F l o a t . p a r s e F l o a t ( tmp [ 2 ] ) ) ; } else base = Float . p a r s e F l o a t ( record ) ; } M a p W r i t a b l e mw = new M a p W r i t a b l e ( ) ; /∗ ∗ ∗ w r i t e t o t h e column , t h e row i s t h e ∗ ngram ∗/ i f ( ! c o u n t s . isEmpty ( ) ) { f o r ( E n t r y <S t r i n g , F l o a t > e n t r y : c o u n t s . e n t r y S e t () ) { F l o a t prob = e n t r y . getValue ( ) / base ; i f ( prob > 1 f ) prob = 1 f ; mw. c l e a r ( ) ; mw. p u t ( new T e x t ( ” g t : p r o b ” ) , new ImmutableBytesWritable ( F l o a t . t o S t r i n g ( prob ) . getBytes () ) ) ; o u t p u t . c o l l e c t ( new T e x t ( k + ” ” + e n t r y . g e t K e y ( ) ) , mw) ;
Appendix A. Source Code
67
} } i f ( k . t o S t r i n g ( ) . s p l i t ( ” \\ s +” ) . l e n g t h == 1 ) { mw. c l e a r ( ) ; mw. p u t ( new T e x t ( ” g t : p r o b ” ) , new ImmutableBytesWritable ( Float . t o S t r i n g ( base / t o t a l ) . getBytes () ) ) ; o u t p u t . c o l l e c t ( k , mw) ; } } } /∗ ∗ ∗ e s t i m a t e and s t o r e t h e p r o b a b i l i t y i n word b a s e d ∗ table ∗ ∗/ p u b l i c s t a t i c c l a s s wordTableReduce extends TableReduce { Long t o t a l ; p u b l i c void c o n f i g u r e ( JobConf job ) { t o t a l = job . getLong ( ” t o t a l ” , 1) ; } /∗ ∗ ∗ k i s t h e n−1 gram c o n t e x t ∗/ p u b l i c v o i d r e d u c e ( T e x t k , I t e r a t o r v , O u t p u t C o l l e c t o r o u t p u t , Reporter r ) throws I O E x c e p t i o n { Float base = 1 f ; HashMap<S t r i n g , F l o a t > c o u n t s = new HashMap<S t r i n g , F l o a t >() ;
Appendix A. Source Code
68
while ( v . hasNext ( ) ) { String record = v . next ( ) . t oStri ng ( ) ; i f ( record . s t a r t s W i t h ( ”\ t ” ) ) { S t r i n g [ ] tmp = r e c o r d . s p l i t ( ” \ t ” ) ; c o u n t s . p u t ( tmp [ 1 ] , F l o a t . p a r s e F l o a t ( tmp [ 2 ] ) ) ; } else base = Float . p a r s e F l o a t ( record ) ; } M a p W r i t a b l e mw = new M a p W r i t a b l e ( ) ; /∗ ∗ ∗ w r i t e t o t h e column f a m i l y , column l a b e l ∗ i s t h e n−1 gram T e x t k , t h e row i s t h e word f r o m ∗ HashMap c o u n t s ∗/ i f ( ! c o u n t s . isEmpty ( ) ) { f o r ( E n t r y <S t r i n g , F l o a t > e n t r y : c o u n t s . e n t r y S e t () ) { F l o a t prob = e n t r y . getValue ( ) / base ; i f ( prob > 1 f ) prob = 1 f ; mw. c l e a r ( ) ; mw. p u t ( new T e x t ( ” g t : ” + k ) , new ImmutableBytesWritable ( F l o a t . t o S t r i n g ( prob ) . getBytes () ) ) ; o u t p u t . c o l l e c t ( new T e x t ( e n t r y . g e t K e y ( ) ) , mw) ; } } i f ( k . t o S t r i n g ( ) . s p l i t ( ” \\ s +” ) . l e n g t h == 1 ) { mw. c l e a r ( ) ; mw. p u t ( new T e x t ( ” g t : u n i g r a m ” ) , new ImmutableBytesWritable ( Float . t o S t r i n g ( base / t o t a l ) . getBytes () ) ) ; o u t p u t . c o l l e c t ( k , mw) ; }
Appendix A. Source Code
69
} } /∗ ∗ ∗ e s t i m a t e and s t o r e t h e p r o b a b i l i t y i n c o n t e x t b a s e d table ∗ ∗/ public s t a t i c c l a s s contextTableReduce extends T a b l e R e d u c e { Long t o t a l ; @Override p u b l i c void c o n f i g u r e ( JobConf job ) { t o t a l = job . getLong ( ” t o t a l ” , 1) ; } /∗ ∗ ∗ k i s t h e n−1 gram c o n t e x t ∗/ p u b l i c v o i d r e d u c e ( T e x t k , I t e r a t o r v , O u t p u t C o l l e c t o r o u t p u t , Reporter r ) throws I O E x c e p t i o n { Float base = 1 f ; HashMap<S t r i n g , F l o a t > c o u n t s = new HashMap<S t r i n g , F l o a t >() ; while ( v . hasNext ( ) ) { String record = v . next ( ) . t oStri ng ( ) ; i f ( record . s t a r t s W i t h ( ”\ t ” ) ) { S t r i n g [ ] tmp = r e c o r d . s p l i t ( ” \ t ” ) ; c o u n t s . p u t ( tmp [ 1 ] , F l o a t . p a r s e F l o a t ( tmp [ 2 ] ) ) ; } else base = Float . p a r s e F l o a t ( record ) ; }
Appendix A. Source Code
70
M a p W r i t a b l e mw = new M a p W r i t a b l e ( ) ; /∗ ∗ ∗ w r i t e t o t h e column f a m i l y , column l a b e l i s t h e word f r o m ∗ HashMap c o u n t s , t h e row i s t h e n−1 gram T e x t k ∗/ i f ( ! c o u n t s . isEmpty ( ) ) { f o r ( E n t r y <S t r i n g , F l o a t > e n t r y : c o u n t s . e n t r y S e t () ) { F l o a t prob = e n t r y . getValue ( ) / base ; i f ( prob > 1 f ) prob = 1 f ; mw. c l e a r ( ) ; mw. p u t ( new T e x t ( ” g t : ” + e n t r y . g e t K e y ( ) ) , new I m m u t a b l e B y t e s W r i t a b l e ( F l o a t . t o S t r i n g ( prob ) . getBytes () ) ) ; o u t p u t . c o l l e c t ( k , mw) ; } } i f ( k . t o S t r i n g ( ) . s p l i t ( ” \\ s +” ) . l e n g t h == 1 ) { mw. c l e a r ( ) ; mw. p u t ( new T e x t ( ” g t : u n i g r a m ” ) , new ImmutableBytesWritable ( Float . t o S t r i n g ( base / t o t a l ) . getBytes ( ) ) ) ; o u t p u t . c o l l e c t ( k , mw) ; } } } /∗ ∗ ∗ e s t i m a t e and s t o r e t h e p r o b a b i l i t y i n h a l f ngram based t a b l e ∗
Appendix A. Source Code
71
∗/ public s t a t i c c l a s s combineTableReduce extends T a b l e R e d u c e { Long t o t a l ; p u b l i c void c o n f i g u r e ( JobConf job ) { t o t a l = job . getLong ( ” t o t a l ” , 1) ; } /∗ ∗ ∗ k i s t h e n−1 gram c o n t e x t ∗/ p u b l i c v o i d r e d u c e ( T e x t k , I t e r a t o r v , O u t p u t C o l l e c t o r o u t p u t , Reporter r ) throws I O E x c e p t i o n { Float base = 1 f ; HashMap<S t r i n g , F l o a t > c o u n t s = new HashMap<S t r i n g , F l o a t >() ; while ( v . hasNext ( ) ) { String record = v . next ( ) . t oStri ng ( ) ; i f ( record . s t a r t s W i t h ( ”\ t ” ) ) { S t r i n g [ ] tmp = r e c o r d . s p l i t ( ” \ t ” ) ; c o u n t s . p u t ( tmp [ 1 ] , F l o a t . p a r s e F l o a t ( tmp [ 2 ] ) ) ; } else base = Float . p a r s e F l o a t ( record ) ; } M a p W r i t a b l e mw = new M a p W r i t a b l e ( ) ; /∗ ∗ ∗ w r i t e t o t h e column f a m i l y , column l a b e l is the l a s t half ∗ ngram c o n t e x t , t h e row i s t h e f i r s t context ∗/
h a l f ngram
Appendix A. Source Code
72
i f ( ! c o u n t s . isEmpty ( ) ) { f o r ( E n t r y <S t r i n g , F l o a t > e n t r y : c o u n t s . e n t r y S e t () ) { F l o a t prob = e n t r y . getValue ( ) / base ; i f ( prob > 1 f ) prob = 1 f ; mw. c l e a r ( ) ; S t r i n g ngram = k . t o S t r i n g ( ) + ” ” + e n t r y . getKey ( ) ; S t r i n g [ ] a r r a y = ngram . s p l i t ( ” \\ s +” ) ; int half = array . length / 2; S t r i n g row = a r r a y [ 0 ] ; String label = array [ half ]; f o r ( i n t i = 1 ; i < h a l f ; i ++) { row = row + ” ” + a r r a y [ i ] ; } f o r ( i n t i = h a l f + 1 ; i < a r r a y . l e n g t h ; i ++) { label = label + ” ” + array [ i ]; } mw. p u t ( new T e x t ( ” g t : ” + l a b e l ) , new ImmutableBytesWritable ( F l o a t . t o S t r i n g ( prob ) . getBytes ( ) ) ) ; o u t p u t . c o l l e c t ( new T e x t ( row ) , mw) ; } } i f ( k . t o S t r i n g ( ) . s p l i t ( ” \\ s +” ) . l e n g t h == 1 ) { mw. c l e a r ( ) ; mw. p u t ( new T e x t ( ” g t : u n i g r a m ” ) , new ImmutableBytesWritable ( Float . t o S t r i n g ( base / t o t a l ) . getBytes ( ) ) ) ; o u t p u t . c o l l e c t ( k , mw) ; } } }
Appendix A. Source Code
73
/∗ ∗ ∗ e s t i m a t e and s t o r e t h e p r o b a b i l i t y i n i n t e g e r b a s e d table ∗ ∗/ public s t a t i c c l a s s integerTableReduce extends T a b l e R e d u c e { Long t o t a l ; private HBaseConfiguration config ; p r i v a t e HTable t a b l e ; p u b l i c void c o n f i g u r e ( JobConf job ) { t o t a l = job . getLong ( ” t o t a l ” , 1) ; try { c o n f i g = new H B a s e C o n f i g u r a t i o n ( j o b ) ; t a b l e = new HTable ( c o n f i g , new T e x t ( ” idmap ” ) ) ; } catch ( IOException e ) { System . o u t . p r i n t l n ( ” Can ’ t open t a b l e : idmap f o r t h e ngram model . ” ) ; return ; } } /∗ ∗ ∗ c o n v e r t t h e ngram s t r i n g i n t o l o n g i n t e g e r s . Not f o u n d words a r e ∗ r e t u r n e d as ’0 ’ ∗ ∗ @param ngram t h e ngram s t r i n g ∗ @return t h e i n t e g e r r e p r e s e n t a t i o n o f t h e s t r i n g ∗/ p r i v a t e S t r i n g c o n v e r t I D ( S t r i n g ngram ) throws IOException { S t r i n g [ ] a r r a y = ngram . s p l i t ( ” \\ s +” ) ;
Appendix A. Source Code
74
T e x t Column = new T e x t ( ” c o n v e r t : i d ” ) ; S t r i n g myid = ” 0 ” ; b y t e [ ] v a l u e B y t e s = t a b l e . g e t ( new T e x t ( a r r a y [ 0 ] ) , Column ) ; i f ( v a l u e B y t e s != n u l l ) { myid = new S t r i n g ( v a l u e B y t e s ) ; } else return ”0” ; i f ( array . length > 1) { f o r ( i n t i = 1 ; i < a r r a y . l e n g t h ; i ++) { v a l u e B y t e s = t a b l e . g e t ( new T e x t ( a r r a y [ i ] ) , Column ) ; i f ( v a l u e B y t e s != n u l l ) { myid = myid + ” ” + new S t r i n g ( v a l u e B y t e s ) ; } else return ”0” ; } } r e t u r n myid ; } /∗ ∗ ∗ k i s t h e n−1 gram c o n t e x t ∗/ p u b l i c v o i d r e d u c e ( T e x t k , I t e r a t o r v , O u t p u t C o l l e c t o r o u t p u t , Reporter r ) throws I O E x c e p t i o n { Float base = 1 f ; HashMap<S t r i n g , F l o a t > c o u n t s = new HashMap<S t r i n g , F l o a t >() ; while ( v . hasNext ( ) ) { String record = v . next ( ) . t oStri ng ( ) ; i f ( record . s t a r t s W i t h ( ”\ t ” ) ) { S t r i n g [ ] tmp = r e c o r d . s p l i t ( ” \ t ” ) ;
Appendix A. Source Code
75
c o u n t s . p u t ( tmp [ 1 ] , F l o a t . p a r s e F l o a t ( tmp [ 2 ] ) ) ; } else base = Float . p a r s e F l o a t ( record ) ; } M a p W r i t a b l e mw = new M a p W r i t a b l e ( ) ; /∗ ∗ ∗ w r i t e t o t h e column , t h e row i s t h e integer ∗ r e p r e s e n t a t i o n o f ngram ∗/ i f ( ! c o u n t s . isEmpty ( ) ) { f o r ( E n t r y <S t r i n g , F l o a t > e n t r y : c o u n t s . e n t r y S e t () ) { F l o a t prob = e n t r y . getValue ( ) / base ; i f ( prob > 1 f ) prob = 1 f ; mw. c l e a r ( ) ; mw. p u t ( new T e x t ( ” g t : p r o b ” ) , new ImmutableBytesWritable ( F l o a t . t o S t r i n g ( prob ) . getBytes ( ) ) ) ; String id = convertID ( k . t o S t r i n g ( ) + ”\ t ” + e n t r y . getKey ( ) ) ; o u t p u t . c o l l e c t ( new T e x t ( i d ) , mw) ; } } i f ( k . t o S t r i n g ( ) . s p l i t ( ” \\ s +” ) . l e n g t h == 1 ) { mw. c l e a r ( ) ; mw. p u t ( new T e x t ( ” g t : p r o b ” ) , new ImmutableBytesWritable ( Float . t o S t r i n g ( base / t o t a l ) . getBytes ( ) ) ) ; String id = convertID ( k . t o S t r i n g ( ) ) ; o u t p u t . c o l l e c t ( new T e x t ( i d ) , mw) ; } }
Appendix A. Source Code
76
} p u b l i c i n t r u n ( S t r i n g [ ] a r g s ) throws E x c e p t i o n { int order = 3; int type = 1; L i s t <S t r i n g > o t h e r a r g s = new A r r a y L i s t <S t r i n g >() ; f o r ( i n t i = 0 ; i < a r g s . l e n g t h ; ++ i ) { try { i f ( ”−o r d e r ” . e q u a l s ( a r g s [ i ] ) ) { o r d e r = I n t e g e r . p a r s e I n t ( a r g s [++ i ] ) ; i f ( order < 1) { System . o u t . p r i n t l n ( ” order is invalid , using default instead ”) ; order = 3; } } e l s e i f ( ”−t y p e ” . e q u a l s ( a r g s [ i ] ) ) { t y p e = I n t e g e r . p a r s e I n t ( a r g s [++ i ] ) ; i f ( type < 1 | | type > 5) { System . o u t . p r i n t l n ( ” t y p e i s o u t o f r a n g e (1 −5) , using default instead ” ) ; type = 1; } } else { o t h e r a r g s . add ( a r g s [ i ] ) ; } } catch ( NumberFormatException e x c e p t ) { System . o u t . p r i n t l n ( ”ERROR : I n t e g e r e x p e c t e d i n s t e a d of ” + args [ i ]) ; return printUsage ( ) ; } catch ( ArrayIndexOutOfBoundsException except ) { System . o u t . p r i n t l n ( ”ERROR : R e q u i r e d p a r a m e t e r m i s s i n g from ”
Appendix A. Source Code
77
+ args [ i − 1]) ; return printUsage ( ) ; } } / / Make s u r e t h e r e a r e e x a c t l y 1 p a r a m e t e r s l e f t . i f ( o t h e r a r g s . s i z e ( ) != 1) { System . o u t . p r i n t l n ( ”ERROR : Wrong number o f parameters : ” + args . length + ” i n s t e a d of 1. ” ) ; return printUsage ( ) ; } System . o u t . p r i n t l n ( ” t h e o r d e r i s s e t t o : ” + o r d e r ) ; System . o u t . p r i n t l n ( ” t h e t a b l e t y p e i s s e t t o : ” + t y p e ); /∗ ∗ ∗ f o r i n t e g e r b a s e d t a b l e , e x t r a MapReduce s t e p i s required to generate ∗ t h e c o n v e r t map ∗/ i f ( t y p e == 5 ) { J o b C o n f c o n v e r t J o b = new J o b C o n f ( g e t C o n f ( ) , TableGenerator . class ) ; c o n v e r t J o b . s e t J o b N a m e ( ” g e n e r a t e c o n v e r t mapping table ”) ; c o n v e r t J o b . setMapperClass ( convertMap . c l a s s ) ; convertJob . setInputFormat ( SequenceFileInputFormat . class ) ; c on ve rt Jo b . setMapOutputKeyClass ( Text . c l a s s ) ; convertJob . setMapOutputValueClass ( LongWritable . class ) ; c o n v e r t J o b . s e t I n p u t P a t h ( new P a t h (RAWCOUNTDIR) ) ; convertJob . setReducerClass ( convertTable . class ) ; //
i n i t t h e h b a s e t a b l e w i t h t h e c o n v e r t map t a b l e name ” idmap ”
Appendix A. Source Code
78
T a b l e R e d u c e . i n i t J o b ( ” idmap ” , c o n v e r t T a b l e . c l a s s , convertJob ) ; /∗ ∗ ∗ record the running time ∗/ l o n g t 1 = System . c u r r e n t T i m e M i l l i s ( ) ; JobClient . runJob ( convertJob ) ; l o n g t 2 = System . c u r r e n t T i m e M i l l i s ( ) ; long sec = ( t2 − t1 ) / 1000; / / seconds System . o u t . p r i n t l n ( ” r u n n i n g t i m e i s : ” + s e c / 60 + ” minutes ” + s e c % 60 + ” s e c o n d s ” ) ; } /∗ ∗ ∗ Job 3 c a l c u l a t e s t h e a d j u s t e d Good−T u r i n g c o u n t s for t r a i n i n g data ∗ and w r i t e t h e s m o o t h e d c o u n t s t o Hbase t a b l e ∗/ J o b C o n f jobC = new J o b C o n f ( g e t C o n f ( ) , T a b l e G e n e r a t o r . class ) ; jobC . s e t J o b N a m e ( ” ngram w i t h h b a s e ” ) ; jobC . s e t I n t ( ” o r d e r ” , o r d e r ) ; jobC . s e t M a p p e r C l a s s ( e s t i m a t e M a p . c l a s s ) ; jobC . s e t I n p u t F o r m a t ( S e q u e n c e F i l e I n p u t F o r m a t . c l a s s ) ; jobC . s e t M a p O u t p u t K e y C l a s s ( T e x t . c l a s s ) ; jobC . s e t M a p O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ; jobC . s e t I n p u t P a t h ( new P a t h (RAWCOUNTDIR) ) ; switch ( type ) { case 1: jobC . s e t R e d u c e r C l a s s ( s i m p l e T a b l e R e d u c e . c l a s s ) ; //
i n i t t h e h b a s e t a b l e w i t h s p e c i f i e d t a b l e name
TableReduce . i n i t J o b ( o t h e r a r g s . g e t ( 0 ) , simpleTableReduce . class , jobC ) ; break ;
Appendix A. Source Code
case 2: jobC . s e t R e d u c e r C l a s s ( w o r d T a b l e R e d u c e . c l a s s ) ; //
i n i t t h e h b a s e t a b l e w i t h s p e c i f i e d t a b l e name
TableReduce . i n i t J o b ( o t h e r a r g s . g e t ( 0 ) , w o r d T a b l e R e d u c e . c l a s s , jobC ) ; break ; case 3: jobC . s e t R e d u c e r C l a s s ( c o n t e x t T a b l e R e d u c e . c l a s s ) ; //
i n i t t h e h b a s e t a b l e w i t h s p e c i f i e d t a b l e name
TableReduce . i n i t J o b ( o t h e r a r g s . g e t ( 0 ) , contextTableReduce . class , jobC ) ; break ; case 4: jobC . s e t R e d u c e r C l a s s ( c o m b i n e T a b l e R e d u c e . c l a s s ) ; //
i n i t t h e h b a s e t a b l e w i t h s p e c i f i e d t a b l e name
TableReduce . i n i t J o b ( o t h e r a r g s . g e t ( 0 ) , combineTableReduce . c l a s s , jobC ) ; break ; case 5: jobC . s e t R e d u c e r C l a s s ( i n t e g e r T a b l e R e d u c e . c l a s s ) ; //
i n i t t h e h b a s e t a b l e w i t h s p e c i f i e d t a b l e name
TableReduce . i n i t J o b ( o t h e r a r g s . g e t ( 0 ) , integerTableReduce . class , jobC ) ; break ; default : jobC . s e t R e d u c e r C l a s s ( s i m p l e T a b l e R e d u c e . c l a s s ) ; //
i n i t t h e h b a s e t a b l e w i t h s p e c i f i e d t a b l e name
TableReduce . i n i t J o b ( o t h e r a r g s . g e t ( 0 ) , simpleTableReduce . class , jobC ) ; break ; }
79
Appendix A. Source Code
80
/ / open t h e num−o f −u n i g r a m s f i l e t o r e a d t h e t o t a l unigram c o u n t s try { Long t o t a l = 0L ; P a t h t m p F i l e = new P a t h ( ”num−of−u n i g r a m s ” ) ; F i l e S y s t e m f i l e S y s = F i l e S y s t e m . g e t ( jobC ) ; S e q u e n c e F i l e . R e a d e r r e a d e r = new S e q u e n c e F i l e . Reader ( f i l e S y s , t m p F i l e , jobC ) ; L o n g W r i t a b l e v a l u e = new L o n g W r i t a b l e ( ) ; i f ( r e a d e r . n e x t ( new T e x t ( ) , v a l u e ) ) { t o t a l = value . get () ; } reader . close () ; jobC . s e t L o n g ( ” t o t a l ” , t o t a l ) ; } catch ( IOException e ) { System . o u t . p r i n t l n ( ” Can ’ t open f i l e : num−of− u n i g r a m s f o r t h e ngram model . E x i t . ” ) ; r e t u r n −1; } /∗ ∗ ∗ record the running time ∗/ l o n g t 1 = System . c u r r e n t T i m e M i l l i s ( ) ; J o b C l i e n t . r u n J o b ( jobC ) ; l o n g t 2 = System . c u r r e n t T i m e M i l l i s ( ) ; long sec = ( t2 − t1 ) / 1000; / / seconds System . o u t . p r i n t l n ( ” r u n n i n g t i m e i s : ” + s e c / 60 + ” minutes ” + sec % 60 + ” s e c o n d s ” ) ; return 0; } static int printUsage () {
Appendix A. Source Code
81
System . o u t . p r i n t l n ( ” u s a g e : [ OPTIONS ] < t a b l e name>” + ”\ nOptions include : ” ) ; System . o u t . p r i n t l n ( ”−o r d e r :\ t t h e max ngram o r d e r \ n\ t d e f a u l t : 3 \ n ” + ”−t y p e :\ t t h e t a b l e t y p e \ n\ t d e f a u l t :1\ n” ) ; System . o u t . p r i n t l n ( ” t h e t a b l e t y p e c a n be c h o s e n from :”) ; System . o u t . p r i n t l n ( ” 1 . ngram b a s e d \n ” + ” 2 . word based \n” + ” 3 . c o n t e x t b a s e d \ n ” + ” 4 . h a l f ngram b a s e d \n ” + ” 5. i n t e g e r based ” ) ; r e t u r n −1; } p u b l i c s t a t i c v o i d main ( S t r i n g [ ] a r g s ) throws E x c e p t i o n { i n t e r r C o d e = T o o l R u n n e r . r u n ( new C o n f i g u r a t i o n ( ) , new TableGenerator () , args ) ; System . e x i t ( e r r C o d e ) ; } }
A.3
NgramModel.java
NgramModel is the class to query probability from Hbase table for testing texts, and compute the perplexity. package ngram ; import j a v a . i o . I O E x c e p t i o n ; import j a v a . u t i l . A r r a y L i s t ; import j a v a . u t i l . HashMap ; import j a v a . u t i l . I t e r a t o r ;
Appendix A. Source Code
82
import j a v a . u t i l . L i s t ; import j a v a . u t i l . r e g e x . M a t c h e r ; import j a v a . u t i l . r e g e x . P a t t e r n ; import o r g . a p a c h e . hadoop . c o n f . C o n f i g u r a t i o n ; import o r g . a p a c h e . hadoop . c o n f . C o n f i g u r e d ; import o r g . a p a c h e . hadoop . f s . P a t h ; import o r g . a p a c h e . hadoop . f s . F i l e S y s t e m ; import o r g . a p a c h e . hadoop . i o . I n t W r i t a b l e ; import o r g . a p a c h e . hadoop . i o . L o n g W r i t a b l e ; import o r g . a p a c h e . hadoop . i o . F l o a t W r i t a b l e ; import o r g . a p a c h e . hadoop . i o . S e q u e n c e F i l e ; import o r g . a p a c h e . hadoop . i o . T e x t ; import o r g . a p a c h e . hadoop . i o . S e q u e n c e F i l e . C o m p r e s s i o n T y p e ; import o r g . a p a c h e . hadoop . mapred . J o b C l i e n t ; import o r g . a p a c h e . hadoop . mapred . J o b C o n f ; import o r g . a p a c h e . hadoop . mapred . MapReduceBase ; import o r g . a p a c h e . hadoop . mapred . Mapper ; import o r g . a p a c h e . hadoop . mapred . O u t p u t C o l l e c t o r ; import o r g . a p a c h e . hadoop . mapred . R e d u c e r ; import o r g . a p a c h e . hadoop . mapred . R e p o r t e r ; import o r g . a p a c h e . hadoop . mapred . T e x t O u t p u t F o r m a t ; import o r g . a p a c h e . hadoop . mapred . S e q u e n c e F i l e I n p u t F o r m a t ; import o r g . a p a c h e . hadoop . mapred . S e q u e n c e F i l e O u t p u t F o r m a t ; import o r g . a p a c h e . hadoop . u t i l . T o o l ; import o r g . a p a c h e . hadoop . u t i l . T o o l R u n n e r ; import o r g . a p a c h e . hadoop . h b a s e . HTable ; import o r g . a p a c h e . hadoop . h b a s e . H B a s e C o n f i g u r a t i o n ; /∗ ∗ ∗ t r y t o e v a l u a t e t h e ngram model , e s t i m a t e p e r p l e x i t y ∗ scores for testing sentences . ∗ ∗ @author X i a o y a n g Yu
Appendix A. Source Code
∗ ∗/ p u b l i c c l a s s NgramModel e x t e n d s C o n f i g u r e d implements Tool , Mapper { p r i v a t e J o b C o n f myconf ; private HBaseConfiguration config ; p r i v a t e HTable t a b l e ; p r i v a t e HTable c o n v e r t T a b l e ; p r i v a t e HashMap<S t r i n g , F l o a t > mycache ; p r i v a t e i n t type , cache , o r d e r ; p r i v a t e Double p e r p l e x i t y = 0 . 0 , H = 0 . 0 ; p r i v a t e l o n g c o u n t = 0L , t o t a l = 0L ; /∗ ∗ ∗ a map c l a s s t o do t h e word c o u n t s ∗ ∗/ p u b l i c s t a t i c c l a s s myMap e x t e n d s MapReduceBase implements Mapper { private int order ; p r i v a t e S t r i n g words ; p u b l i c void c o n f i g u r e ( JobConf job ) { order = job . g e t I n t ( ” order ” , 3) ; } p u b l i c v o i d map ( L o n g W r i t a b l e key , T e x t v a l u e , O u t p u t C o l l e c t o r o u t p u t , Reporter reporter ) throws I O E x c e p t i o n { String line = value . toString () ; i f ( l i n e == n u l l | | l i n e . t r i m ( ) . l e n g t h ( ) == 0 ) return ;
83
Appendix A. Source Code
84
P a t t e r n m y P a t t e r n = P a t t e r n . c o m p i l e ( ” \\w+ | \ \ p{ P u n c t }+ ” ) ; M a t c h e r myMatcher = m y P a t t e r n . m a t c h e r ( l i n e ) ; S t r i n g B u f f e r s b = new S t r i n g B u f f e r ( ) ; w h i l e ( myMatcher . f i n d ( ) ) { s b . a p p e n d ( myMatcher . g r o u p ( ) ) . a p p e n d ( ” \ t ” ) ; } l i n e = sb . t o S t r i n g ( ) ; S t r i n g [ ] c u r r e n t = l i n e . s p l i t ( ” \\ s +” ) ; words = ” ” ; i f ( order > 1) c u r r e n t = ( ”<s> ” + l i n e ) . s p l i t ( ” \\ s +” ) ; f o r ( i n t i = 0 ; i <= c u r r e n t . l e n g t h − o r d e r ; i ++) { words = c u r r e n t [ i ] ; f o r ( i n t j = i + 1 ; j < i + o r d e r ; j ++) words = words + ” ” + c u r r e n t [ j ] ; o u t p u t . c o l l e c t ( new T e x t ( words ) , new I n t W r i t a b l e (1) ) ; } } } /∗ ∗ ∗ A r e d u c e r c l a s s t h a t j u s t e m i t s t h e sum o f t h e i n p u t values . ∗/ p u b l i c s t a t i c c l a s s myReduce e x t e n d s MapReduceBase implements Reducer { p u b l i c v o i d r e d u c e ( T e x t key , I t e r a t o r values , O u t p u t C o l l e c t o r o u t p u t , Reporter reporter ) throws I O E x c e p t i o n { i n t sum = 0 ;
Appendix A. Source Code
85
while ( val ues . hasNext ( ) ) { sum += v a l u e s . n e x t ( ) . g e t ( ) ; } o u t p u t . c o l l e c t ( key , new I n t W r i t a b l e ( sum ) ) ; } } /∗ ∗ ∗ open t h e Hbase t a b l e f o r q u e r y ∗/ p u b l i c void c o n f i g u r e ( JobConf job ) { myconf = j o b ; S t r i n g tableName = job . g e t ( ” t a b l e ” ) ; type = job . g e t I n t ( ” type ” , 1) ; cache = job . g e t I n t ( ” cache ” , 0) ; t o t a l = job . getLong ( ” t o t a l ” , 1) ; order = job . g e t I n t ( ” order ” , 3) ; mycache = new HashMap<S t r i n g , F l o a t >() ; try { c o n f i g = new H B a s e C o n f i g u r a t i o n ( j o b ) ; t a b l e = new HTable ( c o n f i g , new T e x t ( t a b l e N a m e ) ) ; c o n v e r t T a b l e = new HTable ( c o n f i g , new T e x t ( ” idmap ” ) ); } catch ( IOException e ) { System . o u t . p r i n t l n ( ” Can ’ t open t a b l e : ” + t a b l e N a m e + ” f o r t h e ngram model . E x i t . ” ) ; return ; } } /∗ ∗ ∗ q u e r y t h e Hbase t a b l e f o r d i f f e r e n t t a b l e t y p e s t o get the probability ∗
Appendix A. Source Code
∗ @param ngram ∗
t h e ngram t o q u e r y
∗ @return t h e p r o b a b i l i t y o f t h e i n p u t ngram ∗ @throws I O E x c e p t i o n ∗/ p r i v a t e b y t e [ ] q u e r y T a b l e ( S t r i n g ngram ) throws IOException { S t r i n g row , column ; S t r i n g [ ] a r r a y = ngram . s p l i t ( ” \\ s +” ) ; int length = array . length ; i f ( l e n g t h == 0 ) return null ; switch ( type ) { / / ngram b a s e d case 1: row = ngram ; column = ” g t : p r o b ” ; break ; / / word b a s e d case 2: row = a r r a y [ l e n g t h − 1 ] ; i f ( l e n g t h == 1 ) column = ” g t : u n i g r a m ” ; else { column = ” g t : ” + a r r a y [ 0 ] ; f o r ( i n t i = 1 ; i < l e n g t h − 1 ; i ++) { column = column + ” ” + a r r a y [ i ] ; } } break ; / / c o n t e x t based case 3: row = a r r a y [ 0 ] ; f o r ( i n t i = 1 ; i < l e n g t h − 1 ; i ++) { row = row + ” ” + a r r a y [ i ] ;
86
Appendix A. Source Code
87
} i f ( l e n g t h == 1 ) column = ” g t : u n i g r a m ” ; else column = ” g t : ” + a r r a y [ l e n g t h − 1 ] ; break ; / / h a l f ngram b a s e d case 4: i f ( l e n g t h == 1 ) { row = a r r a y [ 0 ] ; column = ” g t : u n i g r a m ” ; } else { int half = array . length / 2; row = a r r a y [ 0 ] ; column = ” g t : ” + a r r a y [ h a l f ] ; f o r ( i n t i = 1 ; i < h a l f ; i ++) { row = row + ” ” + a r r a y [ i ] ; } f o r ( i n t i = h a l f + 1 ; i < a r r a y . l e n g t h ; i ++) { column = column + ” ” + a r r a y [ i ] ; } } break ; / / i n t e g e r based case 5: b y t e [ ] v a l u e B y t e s = c o n v e r t T a b l e . g e t ( new T e x t ( a r r a y [ 0 ] ) , new T e x t ( ” convert : id ” ) ) ; i f ( v a l u e B y t e s != n u l l ) row = new S t r i n g ( v a l u e B y t e s ) ; else return null ; i f ( array . length > 1) { f o r ( i n t i = 1 ; i < a r r a y . l e n g t h ; i ++) {
Appendix A. Source Code
88
v a l u e B y t e s = c o n v e r t T a b l e . g e t ( new T e x t ( a r r a y [ i ] ) , new T e x t ( ” convert : id ” ) ) ; i f ( v a l u e B y t e s != n u l l ) { row = row + ” ” + new S t r i n g ( v a l u e B y t e s ) ; } else return null ; } } column = ” g t : p r o b ” ; break ; default : row = ngram ; column = ” g t : p r o b ” ; break ; } r e t u r n t a b l e . g e t ( new T e x t ( row ) , new T e x t ( column ) ) ; } p u b l i c v o i d map ( T e x t key , I n t W r i t a b l e v a l u e , O u t p u t C o l l e c t o r o u t p u t , Reporter reporter ) throws I O E x c e p t i o n { F l o a t prob = 0f , alpha = 1 f ; /∗ ∗ ∗ a back−o f f c a l c u l a t i o n t o g e t t h e p r o b a b i l i t y ∗/ boolean f i n i s h e d = f a l s e ; S t r i n g row = key . t o S t r i n g ( ) ; w h i l e ( f i n i s h e d == f a l s e ) { i f ( c a c h e == 1 && mycache . c o n t a i n s K e y ( row ) ) { p r o b = a l p h a ∗ mycache . g e t ( row ) ; f i n i s h e d = true ; } else {
Appendix A. Source Code
b y t e [ ] v a l u e B y t e s = q u e r y T a b l e ( row ) ; i f ( v a l u e B y t e s != n u l l ) { /∗ ∗ ∗ find the probability ∗/ p r o b = a l p h a ∗ F l o a t . v a l u e O f ( new S t r i n g ( valueBytes ) ) ; i f ( c a c h e == 1 && row . s p l i t ( ” \\ s +” ) . l e n g t h < order ) { i f ( mycache . s i z e ( ) > 1 0 0 ) mycache . c l e a r ( ) ; mycache . p u t ( row , F l o a t . v a l u e O f ( new S t r i n g ( valueBytes ) ) ) ; } f i n i s h e d = true ; } else { S t r i n g [ ] words = row . s p l i t ( ” \\ s +” ) ; /∗ ∗ ∗ unseen unigram ∗/ i f ( words . l e n g t h == 1 ) { prob = alpha / ( t o t a l + 1 f ) ; f i n i s h e d = true ; } else { /∗ ∗ ∗ back−o f f t o n−1 gram words ∗/ row = words [ 1 ] ; f o r ( i n t i = 2 ; i < words . l e n g t h ; i ++) row = row + ” ” + words [ i ] ; alpha = alpha ∗ 0.4 f ; finished = false ; } } }
89
Appendix A. Source Code
90
} c o u n t += v a l u e . g e t ( ) ; i f ( prob > 1 f ) prob = 1 f ; i f ( prob > 0 f ) H += v a l u e . g e t ( ) ∗ Math . l o g ( p r o b . d o u b l e V a l u e ( ) ) / Math . l o g ( 2 . 0 ) ; o u t p u t . c o l l e c t ( key , new F l o a t W r i t a b l e ( p r o b ) ) ; } /∗ ∗ ∗ compute t h e f i n a l p e r p l e x i t y score , w r i t e t o a f i l e ∗/ p u b l i c v o i d c l o s e ( ) throws I O E x c e p t i o n { p e r p l e x i t y = Math . pow ( 2 . 0 , −H / c o u n t ) ; P a t h o u t F i l e = new P a t h ( ” p e r p l e x i t y ” ) ; F i l e S y s t e m f i l e S y s = F i l e S y s t e m . g e t ( myconf ) ; if ( fileSys . exists ( outFile ) ) fileSys . delete ( outFile ) ; SequenceFile . Writer writer = SequenceFile . c r e a t e W r i t e r ( f i l e S y s , myconf , o u t F i l e , Text . class , F l o a t W r i t a b l e . class , C o m p r e s s i o n T y p e .NONE) ; w r i t e r . a p p e n d ( new T e x t ( ” p e r p l e x i t y i s : ” ) , new FloatWritable ( perplexity . floatValue () ) ) ; writer . close () ; } static int printUsage () { System . o u t . p r i n t l n ( ” u s a g e : [ OPTIONS ] < t e s t i n p u t d i r > < t a b l e name>” + ”\ nOptions include : ” ) ; System . o u t
Appendix A. Source Code
91
. p r i n t l n ( ”−o r d e r :\ t t h e max ngram o r d e r \n \ t d e f a u l t :3\ n” + ”−c a c h e <0|1 >:\ t u s e t h e c a c h i n g q u e r y \n\ t d e f a u l t :0\ n” + ”−t y p e :\ t t h e t a b l e t y p e \n\ t d e f a u l t :1\ n” ) ; System . o u t . p r i n t l n ( ” t h e t a b l e t y p e c a n be c h o s e n from :”) ; System . o u t . p r i n t l n ( ” 1 . ngram b a s e d \n ” + ” 2 . word b a s e d \n ” + ” 3 . c o n t e x t b a s e d \ n ” + ” 4 . combined word and c o n t e x t b a s e d \n ” + ” 5. i n t e g e r based ” ) ; r e t u r n −1; } p u b l i c i n t r u n ( S t r i n g [ ] a r g s ) throws E x c e p t i o n { int order = 3; int type = 1; i n t cache = 0; J o b C o n f jobA = new J o b C o n f ( g e t C o n f ( ) , NgramModel . class ) ; jobA . s e t J o b N a m e ( ” raw c o u n t ” ) ; L i s t <S t r i n g > o t h e r a r g s = new A r r a y L i s t <S t r i n g >() ; f o r ( i n t i = 0 ; i < a r g s . l e n g t h ; ++ i ) { try { i f ( ”−o r d e r ” . e q u a l s ( a r g s [ i ] ) ) { o r d e r = I n t e g e r . p a r s e I n t ( a r g s [++ i ] ) ; } e l s e i f ( ”−t y p e ” . e q u a l s ( a r g s [ i ] ) ) { t y p e = I n t e g e r . p a r s e I n t ( a r g s [++ i ] ) ; } e l s e i f ( ”−c a c h e ” . e q u a l s ( a r g s [ i ] ) ) { c a c h e = I n t e g e r . p a r s e I n t ( a r g s [++ i ] ) ; } else { o t h e r a r g s . add ( a r g s [ i ] ) ; }
Appendix A. Source Code
92
} catch ( NumberFormatException e x c e p t ) { System . o u t . p r i n t l n ( ”ERROR : I n t e g e r e x p e c t e d i n s t e a d of ” + args [ i ]) ; return printUsage ( ) ; } catch ( ArrayIndexOutOfBoundsException except ) { System . o u t . p r i n t l n ( ”ERROR : R e q u i r e d p a r a m e t e r m i s s i n g from ” + args [ i − 1]) ; return printUsage ( ) ; } } / / Make s u r e t h e r e a r e e x a c t l y 2 p a r a m e t e r s l e f t . i f ( o t h e r a r g s . s i z e ( ) != 2) { System . o u t . p r i n t l n ( ”ERROR : Wrong number o f parameters : ” + o t h e r a r g s . s i z e ( ) + ” i n s t e a d of 2. ” ) ; return printUsage ( ) ; } System . o u t . p r i n t l n ( ” t h e o r d e r i s s e t t o : ” + o r d e r ) ; System . o u t . p r i n t l n ( ” t h e t a b l e t y p e i s s e t t o : ” + t y p e ); System . o u t . p r i n t l n ( ” u s e t h e c a c h i n g q u e r y : ” + ( ( c a c h e ==1) ? ” y e s ” : ” no ” ) ) ; jobA . s e t I n t ( ” o r d e r ” , o r d e r ) ; P a t h i n p u t D i r = new P a t h ( o t h e r a r g s . g e t ( 0 ) ) ; jobA . a d d I n p u t P a t h ( i n p u t D i r ) ; jobA . s e t O u t p u t F o r m a t ( S e q u e n c e F i l e O u t p u t F o r m a t . c l a s s ) ; S e q u e n c e F i l e O u t p u t F o r m a t . s e t C o m p r e s s O u t p u t ( jobA , t r u e ); SequenceFileOutputFormat . setOutputCompressionType ( jobA , C o m p r e s s i o n T y p e . BLOCK) ; P a t h tmp = new P a t h ( ” tmp ” ) ; i f ( F i l e S y s t e m . g e t ( jobA ) . e x i s t s ( tmp ) )
Appendix A. Source Code
F i l e S y s t e m . g e t ( jobA ) . d e l e t e ( tmp ) ; jobA . s e t O u t p u t P a t h ( tmp ) ; jobA . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ; jobA . s e t O u t p u t V a l u e C l a s s ( I n t W r i t a b l e . c l a s s ) ; jobA . s e t M a p p e r C l a s s ( myMap . c l a s s ) ; jobA . s e t C o m b i n e r C l a s s ( myReduce . c l a s s ) ; jobA . s e t R e d u c e r C l a s s ( myReduce . c l a s s ) ; /∗ ∗ ∗ record the running time ∗/ l o n g t 1 = System . c u r r e n t T i m e M i l l i s ( ) ; J o b C l i e n t . r u n J o b ( jobA ) ; J o b C o n f jobB = new J o b C o n f ( g e t C o n f ( ) , NgramModel . class ) ; jobB . s e t J o b N a m e ( ” e v a l u a t i o n ” ) ; jobB . s e t ( ” t a b l e ” , o t h e r a r g s . g e t ( 1 ) ) ; jobB . s e t I n t ( ” o r d e r ” , o r d e r ) ; jobB . s e t I n t ( ” t y p e ” , t y p e ) ; jobB . s e t I n t ( ” c a c h e ” , c a c h e ) ; jobB . s e t I n p u t P a t h ( tmp ) ; jobB . s e t I n p u t F o r m a t ( S e q u e n c e F i l e I n p u t F o r m a t . c l a s s ) ; jobB . s e t O u t p u t F o r m a t ( T e x t O u t p u t F o r m a t . c l a s s ) ; jobB . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ; jobB . s e t O u t p u t V a l u e C l a s s ( F l o a t W r i t a b l e . c l a s s ) ; jobB . setNumReduceTasks ( 0 ) ; P a t h r e s u l t D i r = new P a t h ( ” p r o b ” ) ; i f ( F i l e S y s t e m . g e t ( jobB ) . e x i s t s ( r e s u l t D i r ) ) F i l e S y s t e m . g e t ( jobB ) . d e l e t e ( r e s u l t D i r ) ; jobB . s e t O u t p u t P a t h ( r e s u l t D i r ) ; jobB . s e t M a p p e r C l a s s ( NgramModel . c l a s s ) ; long mytotal = 1; try { P a t h t m p F i l e = new P a t h ( ”num−of−u n i g r a m s ” ) ; F i l e S y s t e m f i l e S y s = F i l e S y s t e m . g e t ( jobB ) ;
93
Appendix A. Source Code
94
S e q u e n c e F i l e . R e a d e r r e a d e r = new S e q u e n c e F i l e . Reader ( f i l e S y s , t m p F i l e , jobB ) ; L o n g W r i t a b l e v a l u e = new L o n g W r i t a b l e ( ) ; i f ( r e a d e r . n e x t ( new T e x t ( ) , v a l u e ) ) { mytotal = value . get () ; } reader . close () ; } catch ( IOException e ) { System . o u t . p r i n t l n ( ” c a n ’ t open f i l e num−of−u n i g r a m s , exit ”) ; r e t u r n −1; } System . o u t . p r i n t l n ( ” t h e t o t a l u n i g r a m number i s : ” + mytotal ) ; jobB . s e t L o n g ( ” t o t a l ” , m y t o t a l ) ; J o b C l i e n t . r u n J o b ( jobB ) ; l o n g t 2 = System . c u r r e n t T i m e M i l l i s ( ) ; J o b C o n f c o n f = new J o b C o n f ( g e t C o n f ( ) , NgramModel . class ) ; conf . setJobName ( ” t e s t ” ) ; P a t h t m p F i l e = new P a t h ( ” p e r p l e x i t y ” ) ; FileSystem f i l e S y s = FileSystem . get ( conf ) ; S e q u e n c e F i l e . R e a d e r r e a d e r = new S e q u e n c e F i l e . R e a d e r ( f i l e S y s , tmpFile , conf ) ; T e x t k = new T e x t ( ) ; F l o a t W r i t a b l e v = new F l o a t W r i t a b l e ( ) ; while ( r e a d e r . next ( k , v ) ) { System . o u t . p r i n t l n ( k . t o S t r i n g ( ) + ” : ” + v . g e t ( ) ) ; } long sec = ( t2 − t1 ) / 1000; / / seconds System . o u t . p r i n t l n ( ” r u n n i n g t i m e i s : ” + s e c / 60 + ” minutes ” + sec
Appendix A. Source Code
95
% 60 + ” s e c o n d s ” ) ; return 0; } p u b l i c s t a t i c v o i d main ( S t r i n g [ ] a r g s ) throws E x c e p t i o n { i n t r e s = T o o l R u n n e r . r u n ( new C o n f i g u r a t i o n ( ) , new NgramModel ( ) , a r g s ) ; System . e x i t ( r e s ) ; } }