Recognition Of Urdu Script

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Recognition Of Urdu Script as PDF for free.

More details

  • Words: 3,449
  • Pages: 5
Recognition of Printed Urdu Script U. Pal and Anirban Sarkar Computer Vision and Pattern Recognition Unit Indian Statistical Institute, 203 B. T. Road, Kolkata-108, India Email: [email protected] Abstract This paper deals with an Optical Character Recognition system for printed Urdu, a popular Indian script. The development of OCR for this script is difficult because (i) a large number of characters have to be recognized (ii) there are many similar shaped characters. In the proposed system individual characters are recognized using a combination of topological, contour and water reservoir concept based features. The feature detection methods are simple and robust. A prototype of the system has been tested on printed Urdu characters and currently achieves 97.8% character level accuracy on average.

These characters are shown in Fig.1(a). Urdu has 10 numerals and the numerals are shown in Fig.1(b). Like other Indian scripts in Urdu also two or more characters may combine and create a complex shape called compound characters. Examples of some compound characters are shown in Fig.2. Also depending on the positions (first, middle or last) in a word the basic shape of a character may be changed. For example see Fig.3. Here an Urdu basic character in its isolated form and its shapes in first, middle and last positions of a word are shown. As a result, the total number of characters to be recognized is very large. Thus, OCR development for Urdu is more difficult than any European language script having a smaller number of characters.

1. Introduction The subject of optical character recognition has received considerable attention in recent years. Several methods for recognition of Latin, Chinese, and Arabic scripts have been proposed [1,4,7,8]. Among Indian scripts, some pioneering work has been done on Bangla [2], Devnagari [10,12] and Oriya [3] scripts, and OCR systems for these scripts are ready for commercialization. Some studies have also been reported on Tamil [4], Telugu [9] and Gurmukhi [5] scripts. However, to the best of our knowledge, no work has been done on Urdu script. In this paper, we are concerned with the recognition of printed Urdu script. In the proposed system, the document image is captured using a flatbed scanner and passed through skew correction, line segmentation and character segmentation modules. These modules have been developed by combining conventional and newly proposed techniques. Next, individual characters are recognized by combination of topological features, contour based features and features obtained from the concept of a water reservoir.

2. Properties of Urdu script In India there are twelve scripts and Urdu is one of the popular Indian scripts. Here we describe some properties of the Urdu script that are useful for building the OCR system. The modern Urdu alphabet consists of 39 basic characters.

14

27

13

26

12

11 10

25

9

24

39

8

7

23

22

38

37

6

21

5

20

36 35 34 33

4

19

32

3

18

17

31

30

2

16

29

1

15

28

(a)

(b) Fig.1. Examples of Urdu alphabet and numerals (a) Basic characters of Urdu alphabet (b) Urdu numerals.

Urdu script has some different characteristics compare to other Indian scripts. Writing style in Urdu is from right to left whereas it is left to right in other Indian scripts. It can be noted that an Urdu basic character may have four components (see character number 6,8,17,19 etc. of Fig.1(a)) while in other Indian scripts this property is rare. There is a structural similarity between Urdu and Arabic script. There are different types in Urdu script like Naskh, Nastaliq, Aswad, Batool, Jaben etc. We consider here Naskh and Nastaliq types.

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR’03) 0-7695-1960-1/03 $17.00 © 2003 IEEE

4. Skew detection and correction

Fig.2. Some examples of Urdu compound characters. Isolated

First

Middle

Last

The digitised text images are first converted into twotone images using a histogram based thresholding approach. Here we represent object pixels by 1 and background pixels by 0. The two-tone image generally shows protrusions and dents in the characters as well as isolated object pixels over the background, which are cleaned by a logical smoothing approach [2]. Casual use of the scanner may lead to skew in the document image. Skew angle is the angle that the text line of the document image makes with the horizontal direction. Skew correction can be achieved by (i) estimating the skew angle, and (ii) rotating the image by the skew angle in the opposite direction.

Fig.3 An isolated basic character and its shapes in first, middle, and last positions in a word are shown.

3. Water Reservoir Principle The water reservoir principle is as follows. If water is poured from one side of a component, the cavity regions of the component where water will be stored are considered as reservoirs [11]. By top (bottom) reservoirs we mean the reservoirs obtained when water is poured from top (bottom) of the component. (A bottom reservoir of a component is visualized as top reservoir when water will be poured from top after rotating the component by 180°). Similarly if water is poured from left (right) side of the component, the cavity regions of the components where water will be stored are considered as left (right) reservoirs. For an illustration see Fig.4. Here top, bottom, left and right reservoirs of some Urdu characters are shown. Water flow direction from a full reservoir is also shown in this figure. Water flow direction

Reservoir from top

Reservoir from left Reservoir from bottom

Reservoir from right

Fig.4. Different reservoirs and their water flow directions are shown in four characters. Water flow directions are shown by dotted arrow.

All reservoirs obtained from a direction of a component are not considered for future processing. The reservoirs having heights greater than a threshold T1 are only considered. The value of T1 is chosen as 2/5 times the corresponding component height. This threshold value is obtained from the experiment.

(a) (b) Fig.5. (a): Example of an Urdu skewed text (b): Candidate points for Hough transform are shown.

In this work, we use a Hough transform based technique for skew angle estimation. To reduce the amount of data to be processed by the Hough transform, we compute some candidate points considering some selected components from the image. For component selection, mean width bm of the bounding boxes of the connected components is computed and components having bounding box width greater than 0.5 u bm are selected. Thus, small and irrelevant components like dots, punctuation marks, small modifiers, etc. are mostly filtered out of the skew estimation process. Let I is the image containing only selected components; B is the set of lowermost points of the top reservoirs obtained from the selected components of I ; I c is the anti-clockwise rotated (90o) image of I and B c is the set of lowermost points of the top reservoirs of the components belong to I c for which top reservoirs do not obtained before rotation. Let L B ‰ B c . Then L is the candidate points for Hough transform. Candidate points are chosen in such a way that they will lie, more or less, on parallel straight lines and hence these points will be good representative for Hough transform. An Urdu skewed text is shown in Fig.5(a) and the candidate points for Hough transform of this skewed text are shown in Fig.5(b). From Fig.5(b) it can be seen that most of the candidate points of a

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR’03) 0-7695-1960-1/03 $17.00 © 2003 IEEE

text line lie on a straight line. For skew angle detection, usual Hough transform is used on these candidate points. After skew angle detection the image is rotated according to the detected skew angle. It has been noted that font style and size variations do not affect this skew estimation method. Also, the proposed method can handle documents with skew angle between +45° to - 45°. From our experiments is it observed that about 97.4% of the cases our method can compute the skew angles with a tolerance of r 0.5 degree.

5. Line and character segmentation The proposed OCR system automatically detects individual text lines and then segments the characters in each line. We do not segment words from a line for the recognition purpose. The lines of a text block are segmented by finding the valleys of the projection profile computed by counting the number of black pixels in each row. The trough between two consecutive peaks in this profile denotes the boundary between two text lines. A text line can be found between two consecutive boundary lines. An Urdu text with its projection profile is shown in Fig.6. Line segmentations are shown by dotted lines in this figure.

Fig.6. Horizontal projection profile of an Urdu text and its line segmentations are shown.

segmentation. During component labeling we check vertical overlapping of the components. If two or more components are fully vertical overlapped we assume that components are different parts of a character. If the horizontal overlapping between two bounding boxes of two consecutive components is less than 35% we detect these two components are parts of two different characters. 6. Feature selection For the initial classification of characters, we consider topological features, contour features as well as features obtained from then concept of water reservoirs. The topological features used include existence of holes (loops) and their number, position of holes with respect to the character bounding box, ratio of hole height to character height, number of different components in a character, etc. Contour features include characteristics of different profiles obtained from a portion of character’s contour. The main water reservoir based features used in the recognition scheme are (a) number of reservoirs from different sides of a component (b) position of a reservoir with respect to its character bounding box (c) height of a reservoir (d) water flow level of a reservoir (e) direction of water overflow (f) ratio of reservoir height to component height etc. These features are used to design a tree classifier where the decision at each node of the tree is taken on the basis of the presence/absence of a particular feature. The features considered here are simple and easy to detect. They are fairly robust to noise and stable with respect to font and style variations. For the detection of holes a background component labelling technique is used. The width and height of these holes are also measured in units of text line height.

7. Character recognition Character segmentation is done by a combination of component labeling and vertical projection profile methods. A text line is scanned vertically. If in one vertical scan two or less object pixels are encountered then the scan is denoted by 0, else the scan is denoted by the number of object pixels in that column. In this way a vertical scanning histogram is constructed. Now, if in the histogram there exist a run of at least K1 consecutive 0's then the midpoint of that run is considered as the boundary of a character. The value of K1 is determined from the experiment. Sometimes because of kerned behavior [6] (kerned characters are the characters that overlap with neighbouring characters) of Urdu script some characters of a line may not be segmented properly by projection profile method. To take care of such cases we apply component labeling approach along with projection profile method. If we notice that the distance between two consecutive character boundaries is big we suspect there is a mis-segmentation in this position and we use component labeling for further

In the present paper we recognize basic characters and numerals of Urdu script. We do not consider the recognition of compound characters here. The characters are recognized in two stages. In the first stage, the characters are grouped into a few subsets by a feature based tree classifier. In the second stage, we use more sophisticated features to recognize similar characters belong to leaf nodes of the classification tree. The design of a tree classifier has three components: (1) a tree skeleton or hierarchical ordering of the class labels, (2) the choice of features at each non-terminal node, and (3) the decision rule at each non-terminal node. Our tree is a binary tree where the number of descendants from a non-terminal node is two. While traversing the tree, only one feature is tested at each non-terminal node. To choose a feature at a particular non-terminal node, we consider two points: (a) the feature which is robust and (b) the feature which maximally divides the characters in a node into two groups to get an optimal tree. For a given non- terminal node, we

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR’03) 0-7695-1960-1/03 $17.00 © 2003 IEEE

select a feature that best separates the group of patterns in the above sense. A part of the tree classifier used for the recognition of basic characters and numerals is shown in Fig.7. The features used in the tree classifier are mainly based on the concept of water reservoirs. For example, in some nodes we check whether any reservoir exists from top of a character. Similarly reservoirs from bottom and left are also checked in some nodes to classify some characters. The use of other reservoir based features can be noted from the classification tree. From the tree it may be noted that some characters belong both the sub-groups of a node. This is done to reduce the errors which may occur because of poor quality of documents. Most leaf nodes of the tree point to a single character. Some leaf nodes of the classification tree contain two similar-shaped (confusion) characters. For example see Fig.8 where character pair of four different leaf nodes are shown. Here we shall discuss about the classification techniques of the characters of some leaf nodes. Consider for example two characters of Fig.8(a) which belong to the same node of the tree. These two characters came to same node because water flow level of the reservoir of both these characters coincides with the top of the character’s bounding box. To separate them, we divide each of the characters into two parts at the lowermost point of their reservoir. Segmented parts are shown in Fig.9(a). For the character marked as N2 in Fig.8(a) we note the right segmented part is smaller than left segmented part whereas in the character marked as N6 in Fig.8(a) the right segmented part is bigger than the left segmented part. For separation of the two characters shown in Fig.8(b) we note whether a component lies inside the bounding box of the other component in the character. For example see Fig.9(b). For the character marked as 8 in Fig.8(b) a smaller component is fully lies in the bounding box of the biggest component while it not true for the character marked as 3 in Fig. 8(b). For separation of the two characters of Fig.8(c) we use water flow direction of the reservoir. Two characters have different water flow directions. Water flow direction is left for the character marked as 18 in Fig.8(c) whereas it is right for the other character. The water flow directions of these two characters are shown in Fig.9(c).

Fig.7. A portion of the tree classifier for Urdu character recognition is shown. See the Fig.1 where Urdu characters are numbered and in the tree the numbers represent the characters.

(a)

(b)

(c)

(d)

Fig.8: Examples of similar-shaped characters in four leaf nodes of the classification tree.

(a)

(b)

(c)

(d)

Fig.9: Recognition techniques of similar-shaped characters.

Characters shown in Fig.8(d) are very similar. Only a small cavity is extra in the character marked as 38 in Fig.8(d). To separate them we have traced some of their border pixels. For illustration see Fig.9(d). Starting from the topmost right pixel of the character, border pixel tracing is made in clockwise direction until it reaches the lowermost point of the character. During border tracing we calculate the distance of each traced pixel to the right of the character's bounding box. Noting this distance sequence we have separated these characters. For example, in the character marked as 38 we get at least one transition point in this distance sequence while for the character marked as 33 we do not get any transition point. By transition we mean change of distance values from increasing mode to decreasing mode or vice-versa. This technique has been used for the recognition of some of the other similar characters also. Because of page limitation of this conference we do not discuss about the recognition strategies of the characters of other nodes. We used features like number of holes and their positions, number of horizontal crossings in a particular portion of a character, reservoir heights, position of a reservoir with respect to its character bounding box, ratio of reservoir height to component height etc. for the purpose.

8. Results and discussion

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR’03) 0-7695-1960-1/03 $17.00 © 2003 IEEE

The proposed OCR system was tested on a variety of printed Urdu documents. Some of the documents contain good-quality printing on clean paper some others are of inferior printing and paper quality (e.g. a cheap alphabet book for children) etc. In this section, we summarize the results of our experiments.

Line and characters segmentation: Our system identifies individual text lines with an accuracy of 98.3%. Occasionally, when two adjacent text lines are close to each other, the lower part of the upper line has some overlap with the upper part of the lower line. In such situations, there is no clear valley between the two lines in the projection profile, and our system fails to detect the boundary between them. All such errors were confined to the inferior-quality documents. The character segmentation accuracy of the system is 96.9%. Most segmentation errors were caused by the touching and compound characters. Sometimes some errors were caused for overlapping also.

Character recognition: We tested our recognition system on 3050 characters and noticed that the system recognizes basic characters and numeral with an accuracy of about 97.8%. The recognition errors can be grouped into two classes: (a) Segmentation errors: Incorrectly segmented characters are obviously recognized incorrectly. These errors contribute 0.7% to the total error rate. (b) Tree classification errors: Though water based features are powerful and insensitive to font size and style variations, topological feature may not always correctly distinguish two similar shaped characters. Thus, a character may sometimes be mis-recognized as another character contained in the same leaf node of the classification tree. This class of errors contributes 1.1% to the overall error rate. Other errors mainly come during the recognition of confusion characters in leaf nodes.

[1] A. Amin, “Off-line Arabic character recognition: The state of the art”, Pattern recognition, vol.31, pp.517-530, 1998. [2] B. B. Chaudhuri and U. Pal, “A complete printed Bangla OCR system”, Pattern Recognition, vol.31, pp.531-549, 1998. [3] B. B. Chaudhuri, U. Pal and M. Mitra, “Automatic Recognition of Printed Oriya Script”, Sadhana, vol.27, part 1. pp.23-34, 2002. [4] V. K. Govindan and A. P. Shivaprasad, “Character recognition - a survey”, Pattern Recognition, vol.23, pp.671-683 1990. [5] G. Lehal and C. Singh, “A Gurumukhi script recognition system”, In Proc. 15th ICPR, vol.2, pp.557560, 2000. [6] Y. Lu, “Machine printed character segmentation-an overview”, Pattern Recognition, vol.28, pp.67-18 1995. [7] S. Mori, C. Y. Suen and K. Yamamoto, “Historical review of OCR research and development”, Proceedings of the IEEE, vol.80, pp.1029-1058,1992. [8] G. Nagy, “Chinese character recognition-A twenty five years retrospective”, In proc. ICPR, pp.109-114, 1988. [9] A. Negi, C. Bhagvati and B. Krishna, “An OCR system for Telugu” In proc. 6th ICDAR, pp.1110-1114, 2001. [10] U. Pal and B. B. Chaudhuri, “Printed Devnagari script OCR system”, Vivek, vol.10, pp.12-24, 1997. [11] U. Pal, A. Belaïd and Ch. Choisy “Touching numeral segmentation using water reservoir concept”, Pattern Recognition Letters, vol.24, pp. 261-272, 2003. [12] R. M. K. Sinha, “Rule based contextual post processing for Devnagari text recognition”, Pattern Recognition, vol.20, pp.475-485, 1995.

Conclusion: This paper describes a system for OCR of printed Urdu script. The recognition accuracy of our prototype is promising, but more work is needed. Our character segmentation method should be improved to handle a larger variety of characters that occur often in images obtained from inferior-quality documents. We also need to recognize compound characters to make it a complete OCR system. In general, the system needs to be tested and fine-tuned on a wider variety of images containing characters in diverse fonts and sizes.

References:

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR’03) 0-7695-1960-1/03 $17.00 © 2003 IEEE

Related Documents