Estimating Language Models Using Hadoop And Hbase

  • Uploaded by: Oleksiy Kovyrin
  • 0
  • 0
  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Estimating Language Models Using Hadoop And Hbase as PDF for free.

More details

  • Words: 37,205
  • Pages: 101
Estimating Language Models Using Hadoop and Hbase

Xiaoyang Yu

NI VER

S

E

R

G

O F

H

Y

TH

IT

E

U

D I U N B

Master of Science Artificial Intelligence School of Informatics University of Edinburgh 2008

Abstract This thesis presents the work of building a large scale distributed ngram language model using a MapReduce platform named Hadoop and a distributed database called Hbase. We propose a method focusing on the time cost and storage size of the model, exploring different Hbase table structures and compression approaches. The method is applied to build up training and testing processes using Hadoop MapReduce framework and Hbase. The experiments evaluate and compare different table structures on training 100 million words for unigram, bigram and trigram models, and the results suggest a table based on half ngram structure is a good choice for distributed language model. The results of this work can be applied and further developed in machine translation and other large scale distributed language processing areas.

i

Acknowledgements Many thanks to my supervisor Miles Osborne for the numerous advices, great supports and for inspiring new ideas about this project during our meetings. I would also like to thank my parent for their trusts in me and encouragements. Thanks a lot to Zhao Rui for her great suggestions for the thesis.

ii

Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified.

(Xiaoyang Yu)

iii

Table of Contents

1

Introduction

1

2

Background

3

2.1

Ngram Language Model . . . . . . . . . . . . . . . . . . . . . . . .

3

2.2

Distributed Language Modeling . . . . . . . . . . . . . . . . . . . .

5

2.3

Hadoop MapReduce Framework . . . . . . . . . . . . . . . . . . . .

6

2.4

Hbase Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

3

4

5

Estimating Language Model using Map Reduce

10

3.1

Generate Word Counts . . . . . . . . . . . . . . . . . . . . . . . . .

10

3.2

Generate Count of Counts . . . . . . . . . . . . . . . . . . . . . . .

14

3.3

Generate Good-Turing Smoothing Counts . . . . . . . . . . . . . . .

15

3.4

Generate Ngram Probability . . . . . . . . . . . . . . . . . . . . . .

16

3.5

Generate Hbase Table . . . . . . . . . . . . . . . . . . . . . . . . . .

19

3.5.1

n-Gram Based Structure . . . . . . . . . . . . . . . . . . . .

19

3.5.2

Current Word Based Structure . . . . . . . . . . . . . . . . .

20

3.5.3

Context Based Structure . . . . . . . . . . . . . . . . . . . .

21

3.5.4

Half Ngram Based Structure . . . . . . . . . . . . . . . . . .

22

3.5.5

Integer Based Structure . . . . . . . . . . . . . . . . . . . . .

24

3.6

Direct Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

3.7

Caching Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

Evaluation Methods

31

4.1

Time and Space for Building LM . . . . . . . . . . . . . . . . . . . .

31

4.2

LM Perplexity Comparison . . . . . . . . . . . . . . . . . . . . . . .

32

Experiments

33

5.1

33

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

6

5.2

Ngram Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

5.3

Table Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

Discussion

43

6.1

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

6.2

Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

Bibliography

47

A Source Code

49

A.1 NgramCount.java . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

A.2 TableGenerator.java . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

A.3 NgramModel.java . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

v

Chapter 1 Introduction In statistical natural language processing, ngram language model is widely used in various areas like machine translation and speech recognition etc. Ngram model is trained using large sets of unlabelled texts to estimate a probability model. Generally speaking, more data gives better model. As we can easily get huge training texts from corpus or web, the computational power and storage space of single computer are both limited. Distributed language models are proposed to meet the needs of processing vast amount of data, trying to solve above problems. In the field of distributed language model processing, most of the relevant works mainly concentrated on establishing the methods for model training and testing, but no detailed discussions about the storage structure. The objective of our work is to efficiently construct and store the model in a distributed cluster environment. Moreover, our interest is to explore the capability of coping different database table structures with language models. The novel aspects of our project are as follows. We show that a distributed database can be well integrated with distributed language modeling, providing the input/output data sink for both the distributed training and testing processes. Using distributed database helps to reduce the computational and storage cost. Meanwhile, the choice of database table structure affects the efficiency a lot. We find that half ngram based table structure is a good choice for distributed language models from the aspects of time and space cost, comparing with other structures in our experiments. We use a distributed computing platform called Hadoop, which is an open source implementation of the MapReduce framework proposed by Google [3]. A distributed database named Hbase is used on top of Hadoop, providing a model similar to Google’s Bigtable storage structure [2]. The training methods are based on Google’s work on large language model in machine translation [1], and we propose a different method 1

Chapter 1. Introduction

2

using Good-Turing smoothing with a back-off model and pruning. We store the ngram model in the distributed database with 5 different table structures, and a back-off estimation is performed in a testing process using MapReduce. We propose different table structures based on ngram, current word, context, half ngram and converted integer. Using Hadoop and Hbase, we build up unigram, bigram and trigram models with 100 million words from British National Corpus. All the table structures are evaluated and compared for each ngram order with a testing set of 35,000 words. The training and testing processes are split into several steps. For each step the time and space cost are compared in the experiments. The perplexity of testing set is also compared with traditional language modeling package SRILM [6]. The results are discussed about the choice of proper table structure for the consideration of efficiency. The rest of this thesis is organised as following: Chapter 2 introduces the ngram language model, relative works on distributed language modeling, the Hadoop MapReduce framework and Hbase distributed database; Chapter 3 gives the detail of our methods, illustrating all the steps in MapReduce training and testing tasks, as well as all the different table structures we proposed; Chapter 4 describes the evaluation methods we use; Chapter 5 shows all the experiments and results; Chapter 6 discusses about the choice of table structure for Hadoop/Hbase and possible future works.

Chapter 2 Background In statistical language processing, it is essential to have a language model which measures the probability of how likely a sequence of words may occur in some context. Ngram language model is the major language modeling method, along with smoothing and back-off methods to deal with the problem of data sparseness. Estimating this in a distributed environment enables us to process vast amount of data and get a better language model. Hadoop is a distributed computing framework that can be used for language modeling, and Hbase is a distributed database which may store the model data as database tables and integrate with Hadoop platform.

2.1

Ngram Language Model

Ngram model is the leading method for statistical language processing. It tries to predict the next word wn given n-1 words context w1 , w2 , ..., wn−1 by estimating the probability function: P(wn |w1 , ..., wn−1 )

(2.1)

Usually a Markov assumption is applied that only the prior local context - the last n-1 words - affects the next word. The probability function can be expressed by the frequency of words occurrence in a corpus using Maximum Likelihood Estimation without smoothing: p(wn |w1 , ..., wn−1 ) =

f (w1 , ..., wn ) f (w1 , ...wn−1 )

(2.2)

where f (w1 , ..., wn ) is the counts of how many times we see the sentence w1 , ..., wn in the corpus. One important aspect is the count smoothing, which adjusts the empirical counts collected from training texts to the expected counts for ngrams. Considering 3

Chapter 2. Background

4

ngrams that don’t appear in the training set, but are seen in testing text, for a maximum likelihood estimation the probability would be zero, so we need to find a better estimation for the expected counts. A popular smoothing method called Good-Turing smoothing is based on the count of ngram counts, the expected counts are adjusted with the formula: Nr+1 (2.3) Nr where r is the actual counts, Nr is the count of ngrams that occur exactly r times. r∗ = (r + 1)

For words that are seen frequently, r can be large numbers, and Nr+1 is likely to be zero, in this case derivations can be made to look up Nr+2 , Nr+3 , ...Nr+n , find a nonzero count and use that value instead. The idea of count smoothing is to better estimate probabilities for ngrams, meanwhile, for unseen ngrams, we can to do a back-off estimation using the probability of lower order tokens, which is usually more robust in the model. For a ngram w1 , ..., wn that doesn’t appear in training set, we estimate the probability of n-1 gram w2 , ..., wn instead:

P(wn |w1 , ..., wn−1 ) =

    p(wn |w1 , ..., wn−1 )

if found (w1 , ...., wn )

   λ(w , ...., w ) ∗ p(w |w , ..., w ) otherwise n 2 1 n−1 n−1 (2.4) where λ(w1 , ..., wn−1 ) is the back-off weight. In general, back-off requires more lookups and computations. Modern back-off smoothing techniques like Kneser-Ney smoothing [5] use more parameters to estimate each ngram probability instead of a simple Maximum Likelihood Estimation. To evaluate the language model, we can calculate the cross entropy and perplexity for testing ngram. A good language model should give a higher probability for one ngram prediction. The cross entropy is the average entropy of each word prediction in the ngram: 1 H(pLM ) = − pLM (w1 , w2 , ..., wn ) n 1 n = − ∑ logpLM (wn |w1 , ..., wn−1 ) n i=1

(2.5)

Perplexity is defined based on cross entropy: PP = 2H(pLM ) 1 n − ∑ logpLM (wn |w1 , ..., wn−1 ) n i=1 = 2

(2.6)

Chapter 2. Background

5

Generally speaking, perplexity is the average number of choices at each word prediction. So the lower perplexity we get, the higher probability of the prediction is, meaning a better language model.

2.2

Distributed Language Modeling

large scale distributed language modeling is quite a new topic. Typically using a server/client diagram as Figure 2.1, we need to efficiently manipulate on large data sets, to communicate with cluster workers and organise their works. The server controls and distributes tasks to all the workers, and clients send requests to server to execute queries or commands. Relative ideals can be seen in Distributed Information Retrieval [7]. Recent works include a method to split a large corpus into many non-overlapping chunks, and make each worker in the cluster load one of the chunks with its suffix array index [8]. The use of suffix array index helps to quickly find the proper context we want, and we can count the words occurrence in each worker simultaneously. In this kind of approach, the raw word counts are stored and served in the distributed system, and the clients collect needed counts and then compute the probability. The system is applied on N-best list re-ranking problem and shows a nice improvement on a 2.97 billion-word corpus.

Figure 2.1: Server/Client Architecture

A similar architecture is proposed later with interpolated models [4]. The corpus

Chapter 2. Background

6

is also split into chunks along with their suffix array index. Also a smoothed n-gram model is computed and stored separately in some other workers. Then the client requests both the raw word counts and the smoothed probabilities from different workers, and computes the final probability by linear weighted blending. The authors apply this approach on N-best list re-ranking and integrate with machine translation decoder to show good improvement in translation quality trained on 4 billion words. Previous two methods solve the problem to store large corpus and provide word counts for clients to estimate probabilities. Later works [1] describe a method to store only the smoothed probabilities for distributed n-gram models. In previous methods, the client needs to look up each worker to find proper context using suffix array index. On the other hand this method result in exactly one worker being contacted per n-gram, and in exactly two workers for context-depended backoff [1]. The authors propose a new backoff method called Stupid Backoff, which is a simpler scheme using only the frequencies and an empirical chosen back-off weight.  f (w1 , ..., wn )   if f (w1 , ...., wn ) > 0    f (w1 , ....wn−1 ) P(wn |w1 , ...., wn−1 ) =    f (w2 , ..., wn )   α∗ otherwise f (w2 , ....wn−1 )

(2.7)

where α is set to 0.4 based on their earlier experiments [1]. The experiments directly store 5-grams model using different sources. According to their results, 1.8T tokens from web include a 16M vocabulary, and generate a 300G n-grams with a 1.8T language model with their Stupid Backoff. These previous works are the theoretical principles of our project. The last work used Google’s distributed programming model MapReduce [3], but the authors didn’t describe the way to store the data. We adapt the MapReduce model for our work, because it contains a clear workflow and has already been proved as a mature model and widely used in various applications by Google, Yahoo and other companies. Although Google’s own implementation of MapReduce is proprietary, we can still choose open source implementation of this model.

2.3

Hadoop MapReduce Framework

Hadoop is an open source implementation of MapReduce programming model. It is based on Java and uses the Hadoop Distributed File System (HDFS) to create multiple replicas of data blocks for reliability, distributing them around the clusters

Chapter 2. Background

7

and splitting the task into small blocks. According to their website, Hadoop has been demonstrated on 2,000 nodes and is designed up to support 10,000 node clusters, so it enables us to extend our clusters in the future. A general MapReduce architecture1 can be illustrated as Figure 2.2. At first the input files are split into small blocks named FileSplits, and the Map operation is created parallelized with one task per FileSplit.

Figure 2.2: MapReduce Architecture

Input and Output types of a Map-Reduce job: (input) →map →→ combine →→ reduce → (output) The FileSplit input is treated as a key/value pair, and user specifies a Map function to process the key/value pair to generate a set of intermediate key/value pairs [3]. When the Map operation is finished, the output is passed to Partitioner, usually a hash function, so all the pairs shared with same key can be collected together later on. After the intermediate pairs are generated, a Combine function is called to do a reduce-like job in each Map node to speed up the processing. Then a Reduce function merges all intermediate values associated with the same intermediate key and writes to output files. Map and Reduce operations are working independently on small blocks of data. The final output will be one file per executed reduce task, and Hadoop stores the output files in HDFS. For input text files, each line is parsed as one “value” string, so the Map function is processing at the sentence level; for output files, the format is one key/value pair per one record, and thus if we want to reprocess the output files, the task is working at the record pair level. For language model training using MapReduce, our original inputs 1 http://wiki.apache.org/hadoop/HadoopMapReduce

Chapter 2. Background

8

are text files, and what we will finally get from Hadoop are ngram/probability pairs. It means that, theoretically we can only use Hadoop to build language model. This chapter gives a brief overview of ngram language models, the smoothing and back-off methods, the relevant works on distributed language modeling, the Hadoop framework and Hbase database.

2.4

Hbase Database

Hbase is the distributed database on top of HDFS, whose structure is very similar to Google’s Bigtable model. The reason we look into using Hbase with language model is, although we can do our work all in Hadoop and store all outputs as text or binary files in HDFS, it will consume more time and extra computations for parsing input records. The bottleneck of Hadoop MapReduce is that the input/output is file based, either plain text or binary file, which is reasonable for storing large amount of data but not suitable for query process. For example, if we want to query probability for one ngram, we have to load all the files into map function, parse all of them, and compare the key with ngram to find the probability value. Basically we need to do this comparison for each ngram in test texts, which will cost quite a long time. Instead of parsing files, we can make use of a database structure such as Hbase, to store the ngram probabilities in database table. The advantage is obvious: database structure is designed to meet the needs of multiple queries; the data is indexed and compressed, reducing the storage size; tables can be easily created, modified, updated or deleted. As in distributed database, the table is stored in distributed filesystem, providing scalable and reliable storage. Meanwhile, for language modeling, the model data is highly structured. The basic format is ngram/probability pair, which can be easily constructed into more organised and compact structures. Considering we may get huge amount of data, the compressed structures are essential from both time and storage aspects. Hbase stores data in labelled tables. The table is designed to have a sparse structure. Data is stored in table rows, and each row has an unique key with arbitary number of columns. So maybe one row can have thousands of columns, while another row can have only one column. In addition, a column name is a string with the :

Related Documents


More Documents from "Moe yeik may"