Document Analysis: philozophyly of Voronoi skeletization
spring semester 2008
Prénom Nom
Outline
Objectives Typology Processing chain Methodology Character recognition Word recognition An OCR experiment Conclusion
2
© Prof. Rolf Ingold
Objectives
Text recognition is the most advanced domain of document analysis It aims at analyzing images to extracting text, i.e. sequences of character codes (ASCII/Unicode)
3
© Prof. Rolf Ingold
Character recognition typology
Machine printed vs. handwritten text on-line vs. off-line handwriting Isolated characters, connected characters, cursive text Various Alphabet Western languages (Roman, Greek, Asian (Chinese, Japanese, Korean, Thai, ... Arabic alphabets Limited vocabulary only numbers, only uppercase text with/without diacritics/punctuation restricted vocabulary (city names, street names, ...) language contextual knowledge
4
© Prof. Rolf Ingold
Factor influencing performance
Variability in style single scriber, omni-scriber mono-font, multi-font, omni-font Geometrical variability in size in orientation (rotated) in transformations (slanted, perspective view, ...) Image resolution binary images, starting at 200 dpi grey-level images starting at 150 dpi Image quality degraded support (historical documents) acquisition conditions (bad illumination, optical aberration, noise, ...) 5
© Prof. Rolf Ingold
European languages
European text is characterized by limited set of characters (26 to ~100 classes) diacritics and punctuation isolated characters and cursive scripts left-to right and top-to-bottom writing large variety of fonts different handwriting styles
6
© Prof. Rolf Ingold
Arabic
Arabic text is characterized right-to-left writing limited set of characters context dependent glyphs connected characters diacritics justification by word stretching
7
© Prof. Rolf Ingold
Asian scripts
Asian text is characterized numerous scripts (hanzi, kanji, hanja, han tu, hiragana, katakana, ...) horizontal and vertical writing very large alphabets structured characters grid based layout
8
© Prof. Rolf Ingold
OCR Methodologies
OCR systems are very complex and combine several steps image segmentation into characters or isolated shapes preprocessing of hypothetical segmented characters size normalization, morphological filtering, thinning, ... feature extraction of isolated shapes global measures (width to height ratio, density, center of gravity, moments, ...) local properties (stroke thickness, ...) shape identification using a single classifier (neural network, support vector machine, ...) multiple classifiers and fusion methods word validation using contextual information
9
© Prof. Rolf Ingold
Character vs. word recognition
There is a paradox with cursive text or connected characters: character recognition supposes prior character segmentation character segmentation requires prior character recognition
Several approaches to bypass this paradox entire word modeling and recognition multiple hypothesis generation and testing combined character segmentation and recognition (HMM)
10
© Prof. Rolf Ingold
Processing chain line segmentation word segmentation character segmentation isolated character rec.
word recognition
normalization
normalization
feature/primitive extr.
feature/primitive extr.
identification
identification
post-analysis recognized word
11
© Prof. Rolf Ingold
Isolated character recognition
Isolated character recognition is applicable on high quality printed text on constrained handwriting (forms) The challenge is to take into account the variability of the class
Performance depends on size of alphabet (number of classes) image quality
12
© Prof. Rolf Ingold
Several classification strategies
Direct comparison with class model
Statistical pattern recognition using features
Structural pattern recognition using primitives
Hybrid approaches combining statistical and structural approaches
Use of multiple classifiers and fusion of their results
13
© Prof. Rolf Ingold
Comparison with class model or class samples
The unknown pattern is compared with one representative of each class (model) a similarity measure is returned decision is determined by most similar sample rejection may occur if similarity is above a threshold
14
© Prof. Rolf Ingold
Similarity measures
Hamming distance
Warping distance
15
© Prof. Rolf Ingold
Features for statistical approaches
Someimes preprocessing at image level is required smoothing size normalization stroke normalization skeletization ... Features are extracted horizontal and vertical projection profile central moments intersections with lines global transforms (Hough, Fourier, ...) local features (densities, moments, ...) ...
16
© Prof. Rolf Ingold
Central moments
Central moments are shift invariant properties defined as
µ pq = ∑x ∑y ( x − x) p ( y − y) q f ( x, y) with
x=
m10 m00
y=
m01 m00
mpq = ∑x ∑y x p y q f ( x, y)
They can be computed using the following formulas
µ00 = m00 µ10 = 0 µ01 = 0 µ20 = m20 − m10 x µ02 = m02 − m01 x
µ11 = m11 + m10 y 2 µ12 = m12 − m02 x − 2m11 y + 2m10 y 2 µ21 = m21 − m20 y − 2m11 x + 2m01 x 2 µ30 = m30 − 3m20 x + 2m10 x 2 µ03 = m03 − 3m02 y + 2m01 y 17
© Prof. Rolf Ingold
Primitives for structural approaches
Shapes are decomposed in strokes and several properties are extracted number of connected components number of holes number and relative position of singular points extremities connections crossings concavities, convexities ... These primitives are represented as strings trees graphs and used for comparison 18
© Prof. Rolf Ingold
Identification
For statistical approaches, different classifier are used discriminant functions kNN classifier multi-layer perceptron support ...
For statistical approaches use hierarchical classification string distances graph matching
19
© Prof. Rolf Ingold
Multilayer perceptron
Information is propagated throw a layered network of "neurons" decision is given by the highest activation on the output layer weights of connections are computed in a training phase
20
© Prof. Rolf Ingold
Hierarchical classification
Structural pattern recognition can be performed hierarchically
21
© Prof. Rolf Ingold
OCR Difficulties
The main sources of errors are variability of character shapes (special fonts, handwriting) image defects : noise and distortions broken or touching characters
shape similarity ("0" and "O", "1", "I" and "l", "5" and "S", ...) small shapes : punctuation, accents, superscripts ("er", "ème") special characters ("©", "½", "±", ...) or bullets
22
© Prof. Rolf Ingold
Word recognition
Text recognition at word level makes sense in case of restricted vocabulary for language driven approaches for knowledge based approaches for keyword spotting
Word recognition is typically used cursive scripts handwriting noisy text, difficult to segment low resolution text
23
© Prof. Rolf Ingold
Word Recognition specificities
Word recognition is more complex than character recognition usually the number of classes is much higher more features are needed Word recognition can take into account external information language based knowledge dictionary character frequencies, bigrams, trigrams, etc. structural constraints (security number, dates, ...) restricted vocabulary (e.g. city names, street names, ...) redundancy (e.g. zip codes and city names) Hidden Markov Models (HH) are suited for word recognition
24
© Prof. Rolf Ingold
Hidden Markov models (1)
Each class is modeled by a two stage stochastic process using hidden and visible states
A model λ=(A,B, π) is composed of A, the matrix of transition probabilities
aij = P( qt = x j | qt −1 = xi ) B, the matrix of observation probabilities
b j ( k ) = P( yk | qt = x j ) π, the vector of the initial state probabilities
π i = P(q0 = xi )
25
© Prof. Rolf Ingold
Hidden Markov models (2)
The probability of an observation can be computed using
P (O | λ ) = ∑ P (O, Q | λ ) = ∑ P (O | λ , Q) P (Q | λ ) Q
=
∑π
q
Q
bq1 (o1 )aq1q2 bq2 (o2 ) ... aqT −1qT bqT (oT )
q1 ,... qT
A pattern is assigned to the model with highest posterior probability (i.e, the model that best explains the pattern)
The parameters of the model (probabilities) are determined in a training phase using training samples
26
© Prof. Rolf Ingold
Post-analysis
Post-analysis is performed with the aim of validating / correcting character recognition based on dictionaries bigrams, trigrams confidence of character recognition (if available)
27
© Prof. Rolf Ingold
Character Recognition Performance
OCR (Optical Character Recognition) is the most mature technique of document analysis
For most applications, very high accuracy is required 99% recognition rate would generate 30-50 errors per page 99,9% – 99,99% is often requested
OCR systems may be designed for standard OCR-A, OCR-B fonts, specially designed for OCR mono-font recognition, specialized for typewriting omni-font or multi-font text recognition (the most popular) trainable systems, being tuned for specific fonts and styles
28
© Prof. Rolf Ingold
OCR experiment
One magazine page [Hebdo 18, 2007, Editorial] good quality printed text medium layout complexity
29
© Prof. Rolf Ingold
First Experiment (Standard MS-Office OCR)
Layout not understood Many OCR errors => Unusable !
30
© Prof. Rolf Ingold
Second Experiment (ReadIris)
Page layout correctly recognized Italic correctly detected including one false detection A few word segmentation errors Correct de-hyphenation A few OCR errors often as consequence of segmentation errors !
31
© Prof. Rolf Ingold
Conclusion on OCR technologies
Imperfect results in printed character recognition Recognition of uncontrolled handwriting not mature Practical problems with mathematics (symbols and formulas) special fonts or scripts logos => perfect document recognition is not achievable !
Some applications can deal with approximate results Recognition algorithms should be tuned to prefer rejections to errors Include manual correction in the processing step
32
© Prof. Rolf Ingold