Pami-03-thmm

  • April 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Pami-03-thmm as PDF for free.

More details

  • Words: 4,530
  • Pages: 15
Hidden Tree Markov Models for Document Image Classification Michelangelo Diligenti∗ ∗ †

Paolo Frasconi†

Marco Gori∗

Dept. of Information Engineering, Universit`a di Siena, Italy

Dept. of Systems and Computer Science, Universit`a di Firenze, Italy

Abstract Classification is an important problem in image document processing and is often a preliminary step towards recognition, understanding, and information extraction. In this paper, the problem is formulated in the framework of concept learning and each category corresponds to the set of image documents with similar physical structure. We propose a solution based on two algorithmic ideas. First, we obtain a structured representation of images based on labeled XY-trees (this representation informs the learner about important relationships between image sub-constituents). Second, we propose a probabilistic architecture that extends hidden Markov models for learning probability distributions defined on spaces of labeled trees. Finally, a successful application of this method to the categorization of commercial invoices is presented.

Keywords Document classification, Machine Learning, Markovian models, Structured information.

1

Introduction

In spite of the explosive increase of electronic communication in recent years, a significant amount of documents is still printed on paper. Image document processing aims at automating operations normally carried out by individuals on printed documents, including reading, understanding, extracting information, and organizing documents according to given criteria. Operations such 1

as information extraction can be greatly simplified if some background knowledge about physical layout is available. Layout classification is therefore an important preliminary step for guiding the subsequent recognition task [6]. One important application is the automatic management of commercial invoice forms. Here a classifier can be employed to determine which company issued the invoice and/or which type of form was employed. Besides other obvious applications (e.g. organizing invoices in separate folders), classification can also help the construction of information extraction modules, since the presence, the position, and the type of various pieces of information significantly varies with issuing company and with the specific form that was employed. Digital libraries are also a target application domain. In this case, layout classification can partition scanned pages according to useful criteria such as presence of titles, advertisements, detect table of contents pages, etc. Again, data organization or automatic metadata extraction can greatly benefit from knowing the class of the document under analysis. In this paper we formulate layout classification as a multiclass classification problem. The choice of document representation is very important for this learning problem. Flat representations (e.g., pixel based), do not carry robust information about the position and the number of basic constituents of the image (such as logos, headers, text fields). More appropriate representations in this domain involve either the extraction of global features, or the use of recursive data structures. When global features are extracted, vector-based machine learning algorithms can be employed for classification. For example, the methods in [19, 20] collect several statistics from the bitmap image and subsequently use oblique decision trees or self-organizing maps for classification. On the other hand, the use of recursive representations allows to preserve relationships among the image constituents. For example in [6, 7] documents are represented as structured patterns, while in [3] both the physical and the logical structure are used to recognize documents. The structured representations suggested in this paper is obtained by extracting an XY-tree from the document image [17]. Nodes are associated with specific boxes of the document and edges represent a containment relation. For each box we run several feature extractors, in order to get a real vector of attributes. The

2

resulting representation aims at capturing local features of constituents, along with structural relationships among constituents, and is therefore less sensitive to interclass variability. A second important practical aspect in document classification is that many systems must deal with a relatively large number of classes, and classes normally vary over time. Discriminant approaches (including multilayer neural networks and support vector machines) can achieve very good accuracy on many learning tasks. However, if the set of classes changes, the classifier must be retrained from scratch. Generative probabilistic models avoid this problem by learning a class conditional distribution P (Y|C = c) for each class c. In this way, given an arbitrary set of classes C (realizations the class variable C can take on) and a generative model for each class c ∈ C, we can obtain h(Y) = arg maxc∈C P (Y|C = c)P (C = c). Based on the above considerations, we suggest a classification method that (1) uses structured data representations and (2) is based on a generative model. The combination of these two features requires a new machine learning tool, since generative models are abundant in the literature for both attribute-value and sequential representations, but not for hierarchical data structures. On the other hand, compositional approaches to pattern recognition have been mainly proposed in conjunction with early structural grammars [10] (lacking sound methods for training) or discriminant models, such as recursive neural networks [9]. The generative classifier we propose in this paper is based on a generalization of hidden Markov models (HMM), a powerful tool for probabilistic sequence modeling. The recent view of the HMM as a particular case of Bayesian networks [21, 2, 15] has provided new theoretical insights and helped conceiving extensions of the standard model in a sound and formally elegant framework. Whereas standard HMMs are commonly employed for learning in sequential domains, our extension can learn probability distributions on labeled trees. We call the novel architecture Hidden Tree Markov Model (HTMM)1 . This extension was first suggested in [9] as one of the possible architectures 1

The model should not be confused with Hidden Markov Decision Trees [14], a graphical

model for learning in sequential domains using a tree structured architecture.

3

for connectionist learning of data structures. In this paper, we develop the architectural details not included in [9] and we present a solution to a real commercial invoices classification problem.

2

Hidden tree-Markov models

We now develop a general theoretical framework for modeling probability distributions defined over spaces of trees. In the following we assume that each instance, denoted Y, is a labeled m-ary tree. Each node, v, is labeled by a K−tuple of (discrete or continuous) random variables, denoted Y (v).

2.1

Model definition

Compositionality in a generative stochastic model can be achieved using probabilistic independence. For each tree Y, we can write P (Y) = P (Y (r), Y 1 , . . . , Y m ) where Y (r) is the label at the root node r and Y i are random sub-trees. We say that the first-order tree-Markov property is satisfied for Y if P (Y i |Y j , Y (r)) = P (Y i |Y (r)) ∀i, j = 1, . . . , m.

(1)

From this property one can prove by induction that P (Y) = P (Y (r))

Y

P (Y (v)|Y (pa[v])).

(2)

v∈V \{r}

where pa[v] denotes the parent of vertex v and V denotes the set of vertices of Y. Unfortunately, such a compositional formulation of a model has two main disadvantages. First, the conditional independence assumption (1) is very unrealistic and does not allow the model to capture correlations that might exist between any two non adjacent nodes. Second, the parameterization of P (Y (v)|Y (pa[v])) might be problematic since the number of parameters grows exponentially with the number of attributes in each label. Like in HMM for sequences, we thus introduce hidden state variables in charge of “storing” and “summarizing” distant information. More precisely, let X = {x1 , . . . , xn } be a finite set of states. We assume that each tree Y is generated by an underlying hidden tree X, a data structure defined as follows. The skeleton of X is identical to the skeleton of Y. Nodes of X are labeled by hidden state variables,

4

denoted X(v), taking realizations on X . In this way, P (Y) =

P

X P (Y, X), where the sum over X indicates the marginalization over all the hidden trees. The first-order tree-Markov property in the case of hidden tree models holds if the two following conditions are met: (1) the first-order tree-Markov property must hold for the hidden tree X, and (2) ∀v the observation Y (v) is independent of the rest given X(v). These condition imply the following global factorization formula: P (Y, X) = P (X(r))P (Y (r)|X(r)) ·

Y

P (Y (v)|X(v))P (X(v)|X(pa[v]))

v∈V \{r}

(3) that can be graphically represented by the belief networks as shown in Figure 1. The associated stochastic model will be called hidden tree-Markov model (HTMM). X(r)

X(v)

Y(r)

Y(v)

Figure 1: Belief network for a hidden tree-Markov model. White nodes belong to the underlying hidden tree X. Shaded nodes are observations

2.2

Parameterization

According to Eq. (3) we see that parameters for the HTMM are P (X(r)), a prior on the root hidden state, P (Y (v)|X(v)), the emission parameters, and P (X(v)|X(pa[v])), the transition parameters. Unless sharing mechanisms are defined, these parameters vary with the node being considered, yielding a large total number of parameters that may quickly lead to overfitting problems. In HTMM, several forms of stationarity can be assumed, each associated with a different form of parameter sharing. We say that a HTMM is fully stationary if it is both transition and emission stationary, i.e. neither P (X(v)|X(pa[v]) 5

nor P (Y (v)|X(v)) depend on v. In this case the total number of parameters is n(1 + n + K) being K the number of parameters in each emission model P (Y |X = x). We say that the model is locally stationary if it is emission stationary and P (X(v)|X(pa[v]) = P (X(w)|X(pa[w]) whenever v and w are both the i-th children of their parent in the tree. This assumption yields n(1 + nm + K) parameters. Finally, we say the model is level stationary if P (X(v)|X(pa[v]) = P (X(w)|X(pa[w]) and P (Y (v)|X(v)) = P (Y (w)|X(w)) whenever v and w belong to the same level of the tree. In this case the number of parameters is n + Ln(n + K), being L the maximum allowed height of a tree.

2.3

Inference and Learning

Since the model is a special case of Bayesian network, the two main algorithms (inference and parameter estimation) can be derived as special cases of corresponding algorithms for Bayesian networks. Inference consists of computing all the conditional probabilities of hidden states, given the evidence entered into the observation nodes (i.e. the labels of the tree). Since the network is singly connected, the inference problem can be solved either by π-λ propagation [18] or by the more general junction tree algorithm [12]. In our experiments we implemented a specialized version of the junction tree algorithm. Given the special structure of the HTMM network (see Figure 1), no moralization is required and cliques forming the junction tree are of two kinds: one containing X(v) and X(pa[v]), the other one containing Y (v) and X(v). In both cases, only two variables are contained in a clique. It should be remarked that P (Y) is simply obtained as the normalization factor in any clique after propagation in the junction tree. Inference in HTMMs is very efficient and runs in time proportional to the number of nodes in Y and to the number of states in X . Each model λi is trained using examples belonging to class ci . Learning is formulated in the maximum likelihood framework and the presence of hidden variables requires an iterative algorithm for solving the optimization problem. In our implementation we have chosen Expectation-Maximization (EM), a general family of algorithms for maximum likelihood estimation in problems with missing data [5]. In the case of Bayesian networks (and thus also in

6

Figure 2: An invoice instance for each issuing company is shown. Invoices are highly-structured patterns, composed by both graphical and textual fields. the case of HTMMs), it can be shown that the E-step reduces to computing expected sufficient statistics for the parameters (i.e., expected transition and emission counts), and the M-step simply consists of updating parameters with the normalized expected sufficient statistics (see, e.g., [11] for details).

3 3.1

Dataset preparation Dataset

We chose commercial invoices as a test bed for document classification. We collected 889 commercial invoices, issued by 9 different companies. The class of an invoice corresponds to the issuing company. The number of available instances of different classes in our dataset was rather unbalanced ranging between 46 and 191. Some instances are shown in Figure 2. Since the dataset contains classified information, images in Figure 2 have been edited and critical fields (such as logos, addresses, and other elements that could lead to the identification of the involved companies) have been covered by black strips. It can be seen than invoices are complex patterns, which are composed by many semantic fields, separated either by lines or white spaces. Although the selected invoices contain both graphical and textual information, we made no attempt to extract text with optical character recognition systems. The

7

purpose of our experiments is to demonstrate that document images can be discriminated using layout information only.

3.2

XY-trees extraction

The XY-tree representation is a well known approach for describing the physical layouts of documents that can preserve relationships among single constituents [17, 16]. The root of an XY-tree is associated with the whole document image. The document is then recursively split into regions that are separated by cut lines. Horizontal and vertical white spaces are alternatively used for cutting, so regions at even levels in the tree are cut along the X axis while regions at odd levels are cut along the Y axis. Since the classic space-based version of the algorithm cannot properly handle lines across the formatted area, our implementation allows both space and line cuts [4]. Linebased splits are rejected if the cut line is shorter than an given fraction of the region size. Moreover, lines always have priority over spaces. A region is not further divided if neither a horizontal or vertical white space can be found, or if the area of a region is smaller than a given resolution. In our experiments this resolution was set to 1% of the total document area, yielding an average tree size of 28 nodes on our dataset. Children are ordered left-toright for vertical cuts and top-to-bottom for horizontal cuts. Since the XY-tree extraction can be strongly affected by salt-and-pepper noise, document images were preprocessed in order to remove small connected components.

3.3

Local feature extraction

The shape of the XY-tree reflects the document layout, but it may be not discriminant enough for classification. A more informative representation can be obtained by labeling each node with a set of features (or attributes) extracted from the region associated with the node. For each region of the document we used a vector of six features: 1. a flag indicating whether the parent was cut using a line or a space; 2. the four coordinates of the bounding box (in normalized document space coordinates). 3. the average grey level of the region. 8

Real variables were quantized by running the maximum entropy discretization algorithm described in [8], obtaining discrete attributes with 10 possible realizations each. Note that since the discretization algorithm collects statistics over the available data, it must be run

4

Implementation and results

We compared HTMMs to the document decision trees (DDT) algorithm described in [1]. DDT is an instance based learner that extends the hierarchical classifier proposed by Dengel [6, 7]. DDT stores training documents (represented as XY-trees) into a tree data structure built to perform an efficient classification. Each leaf of a DDT contains a training example, whereas each internal node contains the XY-subtrees shared by all the training examples that are dominated by that node. Like in traditional decision trees, classification is carried out descending the tree top-down and performing tests at each internal node. Tests in this case consist of computing similarities between the input XY-tree and the XY-subtrees contained in the children of the internal node. Once a leaf is reached, the learner outputs the class of the training example stored into the leaf. We used one HTMM, λk , for each class, and each HTMM had 11 states x1 , . . . , x11 . A document was classified as belonging to class i if P (Y|λk ) > P (Y|λj ) for j 6= k. Transitions were restricted by a left-to-right topology, i.e. we forced P (X(v) = xi |X(pa[v]) = xj ) = 0 if |i − j| > d. In our experiments we set d = 2. This helps us to reduce the number of parameters to estimate. Moreover, we used a fully stationary model (see Section 2) as a form of parameter sharing. One practical difficulty for estimating classification accuracy is that datasets in the domain of commercial invoices are typically small. Moreover, we are interested in evaluating how accuracy decreases with the number of available training examples. In the case of small datasets, well known approaches for obtaining good estimates of classification accuracy are leave-one-out and kfold cross validation. In our experiments, in order to take advantage of all the available instances and, at the same time, to evaluate accuracy as a function of the training set size, we opted for a modified version of the cross validation 9

algorithm. The dataset was divided into M groups of instances and for each N < M we trained the classifier on N groups and tested generalization on the remaining M − N groups. These operations were iterated with N varying between 1 and M − 1 and each time all the

  M N

combinations of groups were

used to carry out an experiment on a new test and training set. In this way the total number of single training/test experiments is 2M − 2.

7-fold

10-fold 100%

100%

98%

98%

96%

94% 92%

HTMM DT

94%

Accuracy

Accuracy

96%

92%

HTMM DT

90% 88%

90%

86%

71%

57%

43%

28%

88% 14%

86%

90%

Fraction of examples in the training set

80%

70%

60%

50%

40%

30%

20%

84% 10%

Fraction of examples in the training set

(a)

(b)

Figure 3: Results of the cross-validation technique, when the available examples are divided into 7 and 10 sets ((a) and (b), respectively). Two experimental sessions were performed. In the first one, we set M = 7 while in the second one, we set M = 10. The results of the first session are shown in Figure 3a and the results of the second one are shown in Figure 3b. HTMMs perform very well on the classification task: using 800 examples (about 90 examples for class) as a training set, the recognition percentage was as high as 99.28% (a 34% relative error reduction compared to the DDT algorithm). Accuracy does not rapidly decrease when a smaller number of examples are used to train the models. The 10-fold results show that when only 20% of the examples are used to train the models (and the remaining 80% of examples are included in the test set), the accuracy is 95%. DDTs do not perform as well as HTMMs when the training set is large and it can cover the statistical variability of the patterns: if 800 examples are inserted into the DDT, then the correct classification rate is equal to 98.26%. On the other hand, when the training set is small, DDT classification capabilities outperform HTMM ones. DDTs can perform well even if the training set is small because they can focus their attention on small portions of the input patterns 10

original image

state 1

state 2

state 2

state 3

state 4

state 6

state 7

state 8

state 9

state 10

state 11

Figure 4: State activation maps for a trained HTMM. Maps are computed on the document image shown in the upper-left corner without trying to model the whole pattern. On the other hand HTMMs take into account the entire input pattern. This turns out to be advantageous if enough training data is provided but may lead to overfitting if the training set is too small.

4.1

Analysis of hidden states

In the case of sequence modeling, HMM states often have an explicit interpretation in terms of temporal events and the Viterbi algorithm can be used to “decode” these events by computing the most likely state sequence, given the observed data. In HTMMs, a similar decoding algorithm can be easily obtained by running the max-calibration procedure [13] after inserting evidence 11

1

Original invoice

2

3

5 6

4

7

8

9

Figure 5:

10

11

Averaged state activation maps. The document image shown on

the left is one instance of the class associated with the model. Activation maps are averaged over the test set. Arrows between states indivate the most likely state transitions. on the observed variables Y (v). This procedure yields the most likely state configuration given the observed XY-tree. By observing the decoded state trees, we noted that several states tend to activate on specific types of image regions. Thus max-calibration can be helpful to perform image segmentation. In Figure 4 we show the state activation maps for a particular invoice image. The activation map for state xi is two-dimensional image superimposed on the document image that reveals on which regions state xi is active. In this context a region R is a rectangular box contained in an XY-tree leaf u. By definition of XY-tree, R is dominated by all the nodes in the path from the root to R. Maps were then obtained as follows. First, we ran the max-calibration algorithm, obtaining for each node v the most likely state x∗ (v). Then, for each state xi 12

and for each region R, the activity of xi in R was defined as a(xi , R) = {v : x∗ (v) = xi and v dominates R} . By inspecting Figure 4 we note how several states specialize to activate on specific invoice fields. For example, state 9 is mostly active on the region containing the customer address, state 2 on the invoice header (logo and information about the issuing company), states 10 and 11 on the purchases list, state 3 on some bottom boxes containing subtotals, and so on. So, besides classification, HTMMs can be used for image segmentation and automatic region labeling. Region labeling may be useful to guide subsequent information extraction algorithms. In Figure 5 we show a set of activation maps averaged over all the test-set invoices of a given class. Again, it can be noted how certain states specialize on specific invoice fields. In this example we see that state 2 corresponds to the purchases list. The state activation decreases descending the sub-region of its competence, reflecting that the probability of having purchased at least n items, decreases with n, whereas at least one item is always bought. The maps also provide interesting insights about how classification is performed. Arrows in the diagram indicate the highest state transition probabilities after learning. These transitions can be interpreted as a sort of grammar indicating how a valid image should be parsed to show that it actually belongs to the class being modeled.

5

Conclusions

In this paper, we have presented a new document categorization system in which the documents are represented by XY-trees. The documents are classified by generalized hidden Markov models which are able to deal with labeled trees. Some experiments have been carried out on a dataset of commercial invoices with very promising results. We have shown that the model learns a high-level semantic task (labelling single fields of the invoices), in order to achieve the final document recognition. This is the result of combining the XY-tree representation (which can effectively isolate and model the semantic relationships among documents constituents) with the HTMM capability of 13

learning hierarchical structure in the data.

References [1] A. Appiani, F. Cesarini, A. Colla, M. Diligenti, M. Gori, S. Marinai, and G. Soda, “Automatic document classification and indexing in high-volume applications,” International Journal on Document Analysis and Recognition, vol. 4, no. 2, pp. 69–83, 2002. [2] Y. Bengio and P. Frasconi, “An input output HMM architecture,” in Advances in Neural Information Processing Systems 7, (G. Tesauro, D. Touretzky, and T. Leen, eds.), pp. 427–434, The MIT Press, 1995. [3] R. Brugger, A. Zramdini, and R. Ingold, “Modeling documents for structure recognition using generalized n-grams,” in Proceedings of ICDAR, 1997. [4] F. Cesarini, M. Gori, S. Marinai, and G. Soda, “Structured document segmentation and representation by the modified x-y tree,” in In Proceedings of the International Conference on Document Analysis and Recognition, pp. 563–566, 1999. [5] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum-likelihood from incomplete data via the EM algorithm,” Journal of Royal Statistical Society B, vol. 39, pp. 1–38, 1977. [6] A. Dengel, “Initial learning of document structure,” in Proceedings of ICDAR, pp. 86–90, 1993. [7] A. Dengel and F. Dubiel, “Clustering and classification of document structure: A machine learning approach,” in Proceedings of ICDAR, pp. 587–591, 1995. [8] U. M. Fayyad and K. B. Irani, “Multi-interval discretization of continuousvalued attributes for classification learning,” in Proc. 13th Int. Joint Conf. on Artificial Intelligence, pp. 1022–1027, Morgan Kaufmann, 1993. [9] P. Frasconi, M. Gori, and A. Sperduti, “A general framework for adaptive processing of data structures,” IEEE Trans. on Neural Networks, vol. 9, no. 5, pp. 768–786, 1998.

14

[10] R. C. Gonzalez and M. G. Thomason, Syntactic Pattern Recognition. Reading, Massachusettes: Addison Wesley, 1978. [11] D. Heckerman, “Bayesian networks for data mining,” Data Mining and Knowledge Discovery, vol. 1, no. 1, pp. 79–120, 1997. [12] F. V. Jensen, S. L. Lauritzen, and K. G. Olosen, “Bayesian updating in recursive graphical models by local computations,” Computational Statistical Quarterly, vol. 4, pp. 269–282, 1990. [13] F. Jensen, An Introduction to Bayesian Networks. New York: Springer Verlag, 1996. [14] M. I. Jordan, Z. Ghahramani, and L. K. Saul, “Hidden markov decision trees,” in Advances in Neural Information Processing Systems, (M. C. Mozer, M. I. Jordan, and T. Petsche, eds.), p. 501, The MIT Press, 1997. [15] H. Lucke, “Bayesian belief networks as a tool for stochastic parsing,” Speech Communication, vol. 16, pp. 89–118, 1995. [16] G. Nagy, , and M. Viswanathan, “Dual representation of segmented and technical documents,” in In Proceedings of the First International Conference on Document Analysis and Recognition, pp. 141–151, 1991. [17] G. Nagy and S. Seth, “Hierarchical representation of optically scanned documents,” in Proc. Int. Conf. on Pattern Recognition, pp. 347–349, 1984. [18] J. Pearl, Probabilistic Reasoning in Intelligent Systems : Networks of Plausible Inference. Morgan Kaufmann, 1988. [19] C. Shin and D. Doermann, “Classification of document page images based on visual similarity of layout structures,” in Proc. SPIE Document Recognition and Retrieval VII 3967, pp. 182–190, 2000. [20] C. Shin, D. Doermann, and A. Rosenfeld, “Classification of document pages using structure-based features,” International Journal on Document Analysis and Recognition, vol. 3, no. 4, pp. 232–247, 2001. [21] P. Smyth, D. Heckerman, and M. I. Jordan, “Probabilistic independence networks for hidden markov probability models,” Neural Computation, vol. 9, no. 2, pp. 227–269, 1997.

15