Preprocessing and Structural Feature Extraction for a Multi-Fonts Arabic / Persian OCR Mandana Kavianifar Adnan Amin School of Computer Science and Engineering University of New South Wales, Sydney NSW 2052, Australia
[email protected] [email protected] Abstract English and Chinese are languages, which have tremendously attracted interests of character recognition researchers. In contrast, research in the field of character recognition for Arabic / Persian scripts face major problems mainly related to the unique characteristics of these two like being cursive, multiple shapes of one character in different positions in a word and connectivity of characters on the baseline. The proposed work consists of three major phases. After digitizing the text, the original image is transformed into a gray scale image using a 300-dpi scanner. Different steps of preprocessing are then applied on the image file. In the next phase, sub-words of all words are recognized and global features for each word are extracted. Contour tracing plays a very important role in the phase of feature extraction.
are not important or alive languages but technical difficulties due to special characteristics of these scripts, have caused the problem. These unique characteristics can be categorized as : • Arabic / Persian script (hand written or printed) is cursive and letters normally connected to each other on an imaginary line called baseline. Hence, separation of the characters from each other in this kind of script is a difficult task. • Each letter in Arabic/Persian script can have two to four different shapes within a word (at the beginning, in the middle, at the end or stand-alone). Figure 1 shows these four different positions for the character ”AIN”. This fact increases the characters to be recognized from 28 (for Persian 32) to 100 (for Persian 114). • Arabic/Persian is written and read from right to left.
1. Introduction Character recognition as one of the most important fields of pattern recognition has been the center of attention for researchers in the last four decades. The modern version of OCR appeared in the middle of 1940’s with the development of the digital computers[1]. Since then several character recognition systems for English, Chinese and Japanese characters have been proposed [12]-[14]. However, development of character recognition systems for other languages such as Arabic and Persian hasn’t had such a fast trend. Arabic is the official language of all countries in North Africa and most of the countries in the Middle East and it is the sixth most commonly used language in the world. In addition, Farsi or Persian with almost 90 million speakers (the official language of Iran, also spoken in Afghanistan), Kurdi, Pashto and Urdu are all written using modified Arabic alphabets [8], [2]. These figures support the fact that slow advancement in Arabic/Persian OCRs is not because they
Figure 1. Four different shapes of character “AIN”. From left to right: stand alone, at the end, in the middle and at the beginning •
•
Most letters have one, two or three dots that can be positioned above, below or inside them (Figure 2). In several characters, the characters’ main shapes are the same and only number and position of dots make them different. Hamzah (zigzag) also has this characteristic. Dot and Hamzah are called “complementary characters”. An Arabic/Persian word is composed of a number of portions which are named “sub-word” in this paper. A sub-word comprises one stand-alone character or a group of connected characters. See Figure 3.
(a) (b) (c) Figure 2. Three different Arabic characters. (a) 1 dot below, (b) 1 dot in the middle, (c ) 3 dots above the character
The first phase of this system is data acquisition, which is done by scanning an Arabic text with 300 dpi resolution. The input file for the system is a gray scale PGM-formatted file, so any other format of grayscale is converted and then used as an input file. Gray scale image files cover more pixels of the original image than binary images, so the file provides more information about it. Also, noise or some features like loop can be distorted in a binary version of an image file, therefore unlike a gray scale file, useful information is corrupted.
2.2. Preprocessing (a)
(b)
( c)
Figure 3. Three Arabic words. (a) contains one sub-word with 4 characters, (b) contains three sub-words with 1,1 and 2 characters respectively, (c) contains two sub-words with 2 and 2 characters Due to these major differences between Arabic/Persian and Latin or Chinese scripts, proposed methods for the latter are not suitable for the former. Single researches in this subject did not start until the early 1980’s. IRAC, which was suggested by Amin et al. [3] used a structural classification method. IRACII [4], was based on segmentation technique. Badi and Shimura used the concept of contour tracing and the identification of the component cursive in their syntactic method [5]. In another method [6] , sub-words are identified and separated in the text. Then a histogram is used to segment each sub-word. Mahmoud [7] adopted a combination of Fourier descriptors and contour tracing for Arabic characters. Contour tracing also plays a very crucial role in the system proposed by Allam [2]. During last five years, researchers have suggested several other methods and for a complete literature on the subject of Arabic OCR, the reader is referred to [9]. It must be mentioned that Sakhr Automatic reader no-3.01 and Shonut’s Omnipage Pro Version 2.0 are two samples of commercially available Arabic character recognition systems [17]. In this paper, a Multi-font Arabic/Persian character recognition system and its major phases have been explained. Within the paper, “Arabic” refers to both Arabic and Persian unless it is mentioned explicitly.
2. The proposed recognition system The proposed character recognition system for Arabic character set is composed of the following phases:
2.1. Digitization
Preprocessing (pixel level processing) on the input file makes it ready for further processes. This major phase includes the following steps: 2.2.1. Global thresholding. A suitable threshold among potential thresholds is selected as a global threshold by employing Otsu’s method [15]. Pixels with value greater than the global threshold are assumed as background and others as foreground pixels. 2.2.2. Connected components recognition. Connected components (cc) are rectangular boxes bounding together regions of connected foreground pixels. The objective of this step is to form these rectangles around distinct components of the input file [16]. 2.2.3. Grouping. The next step is the grouping of neighboring connected components of similar dimension. The algorithm takes one cc at a time and tries to merge it into a group from a set of existing groups. If it succeeds, the group’s dimensions are altered to cater for the new cc. If the cc can not be merged with any of the existing group then a new group is formed with its sole member being the cc. Figure 4 shows an Arabic word, its connected components and group.
(a) (b) (c) Figure 4. (a) An Arabic word, (b) Its connected components (sub-words), (c) Its group 2.2.4. Skew detection and correction. The adopted skew detection algorithm [11], attempts to determine the skew angle of the entire document by calculating a skew angle for each group. Then, skew correction algorithm is applied on the input file.
pixel of the contour, necessary information about each pixel of the contour , contour’s length and class.
2.3. Feature extraction The proposed work has adopted a global approach for character recognition and no character segmentation is required. 2.3.1. Contour tracing. The basic step for determining sub-words within each word is tracing the outer contours of all its elements. Within the boundaries of each group (representing a word) a raster scan from top to bottom and left to right is started until the first foreground pixel is reached. From this point, contour tracing is begun by adopting Freeman chain code and the Left-Most-Looking (LML) rule [7]. External and loop are two different types of contours which are produced in this step. An Arabic word and its contours are shown in Figure 5. 2.3.2. Contour analysis. In this step, first all the word’s sub-words are determined and then each sub-word’s contours are analyzed and classified. Each sub-word can have three types of external contours: main body, complementary character or noise. Big noises in size, are another sort of contour that can be find in the image file. During the digitization phase, some spurious pixels may result in the image file . Fortunately, they are not so big and would be recognized by their contour’s length.
(a)
(b)
(c)
(d )
2.3.3. Sub-words detection. Another analysis on the set of contours is begun to find a main body which determines a sub-word. To find vertical boundaries for this contour, two pixels of it with the largest and smallest values of Y-coordinate must be identified. Tracing the linked list of information about all pixels of the contour and comparing their Y-coordinates with the current amount of minimum and maximum does the task. 2.3.4. Detecting a Sub-word’s complementary characters. All the contours of complementary characters which belong to the detected sub-word should be found. Therefore, another search is taken place through the array to find all the external contours which the Y coordinate of their starting points are fallen within the two boundaries. As mentioned before, a dot can be in a form of 1 dot, 2 or 3 dots. In some fonts, dots can be attached together and form a bigger island. For detecting non-attached dots, the distance between X and Y coordinates of the starting point of their contour is checked. If these distances were less than two certain thresholds, the contours belong to non-attached dots and number of contours determines type of dots (2 or 3). To detect type of attached dots (See Figure 6), three different methods have been designed and tested on all available Arabic fonts for Windows. Theses methods are conducted based on comparison of length and area of the attached dots’ contour with different defined thresholds and also by comparing the chain-code sequence of the contour with certain patterns. The achieved results have shown the third method works best among the three. In fact the third method enables the proposed system to work as an Omni-font OCR for all the available fonts in Arabic for Windows word processor.
(e)
Figure 5. (a) An Arabic word, (b) Its main body contours (external), (c) The contour of upper dot (external), (d) The contour of lower dot (external), (e) The contour of loop (internal) Two thresholds, TN and TM are set to help finding contour’s class. If the Contour’s Length (CL) is less than TN, the contour is considered a noise and its class gets the value of 0. If CL >= TM, then the contour belongs to a main body and value of 1 is assigned to the field. If TN < CL < TM , then a complementary character’s contour has been found and this field will have the value of 2. At the end of this step an array of information about contours for each word’s sub-word is built. This array contains contour’s type, coordinates of the first
(a) (b) Figure 6. (a), (b) Two Arabic words, each with one 3 attached dots and one 2 attached dots Dots in Arabic characters can be positioned under, above or in the middle of them. Because position of dots with respect to baseline can determine their position with respect to character, correct detection of baseline is important. Two different methods have been suggested and tested for baseline detection. The tests’ results show that the second method has been more successful than the first one. 1. Baseline detection using contour’s chain code [10].
2.
Baseline detection using horizontal projection profile and finding a row with the largest density. Comparing the starting point’s X coordinate with baseline location would do determining position of dots. 2.3.5. Producing output file. At the end of feature extraction step, an output file is produced which contains all the information about every word in the image file. Each line contains the following information: Name of the word, number of sub-words, number of peaks in the histogram of sub-word, type and position of complementary characters and number of loops. The last four fields are repeated for every sub-word of the word.
3. Experimental Results and Conclusion The proposed work has been tested on some printed Arabic words with 14 of 19 different fonts available in Arabic for Windows (famous fonts like Naskh, Andalus, Kufi and traditional Arabic). The system has been developed by Microsoft Visual C++ on Windows NT platform. Table 1 shows the result of feature extraction phase on several sample Arabic words. Name of Entity
% of Correct Recognized
Sub-Word
97%
One dot Two dots Three dots Loop
~ 91% ~78% ~63% 100%
Table 1. Tests’ results of feature extraction phase Some Arabic fonts and other complex features of Arabic script have caused poor results for the recognition of 2 and 3 dots. In these fonts contour’s length and area for 2 or 3 dots are different from most of them ,so finding correct values for thresholds would be difficult and in some cases impossible. Another reason for rather poor recognition of dots is the attaching dots to main bodies in some fonts. In these cases no contours would be traced for dots, therefore they wouldn’t be recognized as dots at all. The system has an excellent performance for mono font in feature extraction phase because the problem of attached dots together and/or to the main body doesn’t occur. Also the behavior of the system’s algorithms are predictable on noisy documents. The system recognizes all the existent loops of the words.
References [1] V .K .Govindan and A.P.Shivaprasad, “ Character Recognition – A Review”, Pattern Recognition, Vol. 23, No. 7, pp. 671-683, 1990 [2] M. Allam, “Segmentation versus Segmentation-free for Recognizing Arabic Text”, Document Recognition II, SPIE, Vol. 2422, pp.228-235, 1995 [3] A. Amin, A. Kaced, J. P .Haton and R .Mohr, “Handwritten Arabic Character Recognition by the IRAC system”, Proceedings of the 5th Int. Conference on Pattern Recognition, pp. 729-731, 1980 [4] A. Amin, “Machine recognition of hand written Arabic words by the IRAC II system”, Proceedings of the 6th Int. Joint Conference on Pattern Recognition, pp.34-36, 1982 [5] K. Badi and M. Shimura, “Machine recognition of Arabic cursive scripts”, Pattern Recognition in practice, Amsterdam: North Holland, 1980 [6] A. Amin and G. Masini,“Machine recognition of multi-font printed Arabic texts ”, Proceedings of the 8th International Conference on Pattern Recognition, pp. 392-395, 1986 [7] S. A. Mahmoud , “ Arabic character recognition using Fourier descriptors and character contour encoding ” , Pattern Recognition, Vol.27, no. 6, pp. 815-824, 1994 [8] M. R. Hashemi , O. Fatemi and R. Safavi, “ Persian script recognition ”, Proceedings of the third Int. Conference on document analysis and recognition , Vol. II, pp. 869-873, 1995 [9] A. Amin, “ off – line Arabic Character Recognition : The state of the art”, Pattern Recognition, Vol. 31, No. 5, pp. 517530, 1998 [10] M. Kavianifar , “ A Persian/Arabic Character Recognition System ”, Proceedings of Pan-Sydney Area Workshop on Visual Information Processing, pp. 207-211, 1997 [11] A. Amin , S. Fischer , A. Parkinson and R. Shin, “ Comparative Study of Skew Detection Algorithms”, Journal of Electronics Imaging 5 (4), pp. 443-451, 1996 [12] I. Sekita , K. Toraichi, R. Mori, K. Yamamoto and H. Yamada, “ Feature extraction of handprinted Japanese characters by spline function or relaxation matching”, Pattren Recognition, Vol. 21, pp. 9-17, 1988 [13] X. L. Xie and M. Suk, “ On machine recognition of handprinted Chinese characters by feature relaxation “, Pattern Recognition, Vol. 21, pp. 1-7, 1988 [14] H. Matsumura, K. Aoki,T. Iwahara, H. Oohama, and K. Kogura, “ Desktop optical handwritten character reader “, Sanyo tech. Rev. Vol. 18, pp. 3-12, 1986 [15] N. Otsu, “ A threshold selection method from gray level histogram “, IEEE Trans. Syst. Man Cyb., SMC-9(1), pp. 6266, 1979 [16] A. Amin and M. Kavianifar, “ Automatic recognition of printed Arabic text using neural network classifier “, Proc. of 9th Int. Conference in Image Analysis and Processing, Vol. II, pp. 616 – 623, 1997 [17 ]http://www.sakhrsoft.com, http://www.shonut.com