English To Urdu Translation System Tafseer Ahmed (
[email protected]), Sadaf Alvi (
[email protected]) University of Karachi, 2002 Abstract This paper describes a system for English to Urdu Machine Translation. English and Urdu are different languages in terms of structure, word order in phrases and phrase order in clauses and sentences. Also words in Urdu change their forms with respect to gender, number and other attributes of sentences. This paper identify many such issues and suggest solutions of those. To design and implement Translation System, Grammar of English language is developed, and sentences are parsed using Bottom Up Chart Parsing Algorithm. Also Transformational Engine from Parse Tree of English Sentence to Urdu Sentence is developed. 1. Introduction Machine translation systems are need of today. The term 'machine translation' (MT) refers to computerized systems responsible for the production of translations with or without human assistance.[3] There is huge amount of textual data present in digital form and amount of this is increasing day by day. Most of these data is in English Language, but English is not the language of whole world. A large percentage of world population either cannot read English or has little expertise of this language. So Translators are needed from English to other languages. Urdu is the national language of Pakistan. It is also spoken in India and some other South Asian countries. Urdu belongs to Aryan family of languages[2]. It has accepted influence of Hindi, Arabic, Persian and Turkish languages. Machine Translation is considered as one of the most difficult problems of Computer Science. The paper identifies the major problems in translation from English to Urdu Translation, and suggests their solutions. It also gives the description of English to Urdu Translation System. 2. English and Urdu: Comparitive Study English and Urdu are languages which are different in many respects. The major difference between English and Urdu are: 2.1 SVO vs SOV Language English is classified as a Subject-Verb-Object language because Subject, Verb and Object phrases in a sentence are present in this order. An example of English sentence is I write an essay Subject Verb Object In contrast, Urdu is a Subject-Object-Verb(SOV) language. An example of Urdu sentence is main(I) aik mazmoon(an essay) likhta hon(write) Subject Object Verb 2.2 Word Order in Phrase Another differenece in structure of English and Urdu is that order of words in certain phrases is different in both languages. In Prepositional Phrase of English, Head word is not the last word of the phrase. But in Urdu, Head word is the last word of the phrase. For example, in English Book of Urdu Noun(Head) Preposition Noun But in Urdu, the phrase will be written as: Urdu ki(of) kitab(book) Noun Preposition Noun(Head)
2.3 Many Forms of Word In every language, a word is used in many inflected forms depending upon features of other words and phrases of the sentence. For example, verb in English is present in five forms (base, 1st, 2nd, 3rd and 4th forms). If a sentence is of Present Indefinite tense then 1st form of verb will be used. But in sentences of continous tense, 4th form of verb is used. But Urdu has more inflected forms than English. In English, adjective never changes their form with respect to the noun they are modifying. For example, In Green Book, Green Books, Green Pen, Green Pens the same word "Green" is used irrespective of the number and gender of the Noun. But Urdu has different inflectional forms of adjectives for different number and genders of the head Noun. In Urdu, these phrases will be written as Hari(Green) Kitab(Book), Hari(Green) Kitabain(Books), Hara(Green) Qalam(Pen), Haray(Green) Qalam(Pens) Similarly, prepositions and verbs have inflected forms in Urdu. 2.4 Noun and Verbs have Gender in Urdu English common nouns and verbs do not have genders associated with them. But in Urdu, every noun and verb has a gender associated with it. For example, Book is used as feminine and Pen is used as masculine. This is the cause of many forms of adjectives and prepositions used with different nouns (as in Example of 2.3). Similarly verbs have gender. In some sentences, inflectional form of the verb depends on its gender. aadmi(man) ne aurat(woman) se bat ki(talked). aurat(woman) ne aadmi(man) se bat ki(talked). 2.5 Types of Tense English and Urdu do not have one to one correspondence in Tense Types. Hence it is not necessary that Urdu has equivalent of every tense of English grammar. For example, Urdu has more than one versions of Past Indefinite Tense like Absolute Past, Near Past and Distant Past, [2] and one can translate English sentence to any one of these Urdu tenses. “He bought a book” can be translated as us ne kitab kharidi hay. us ne kitab kharidi thi. us ne kitab kharidi. 3 Translator System The System is divided into three major parts: The (1) Lexical Module converts input text into Token with features, the (2) Syntactical Module generates Parse Structure from the tokens obtained from Lexical Module and the (3)Transformational Module that transforms English Parse Structure (generated by Syntactical Module) to Urdu Sentence. 3.1 Lexical Module The function of lexical module is to generate tokens with features corresponding to each word. This module is divided into two sub-modules. The (1) Preprocessor Sub-module takes word as text and breaks it into words. It replaces short forms of the words to the original words. For example, don't is replaced by do not. It also attaches some flags with each showing that it is a Proper Noun, Possessive Noun etc (2) Token Generator sub module generates tokens with features corresponding to each word supplied by Preprocessor Module. Features are used to store attributes about a word. Usually word's Part of Speech (Category), Root (Base or Root Form), and properties like number of noun, degree of adjective and person of pronoun are parts of feature structure. For English to Urdu Translation, a special purpose feature set is designed which have attributes to
deal with all identified issues of translation. Table 1 shows list of features and their purpose. Table 1: Feature of English Word Features Description Category Part of Speech of the word. Its permissible values are Noun, Pronoun, Determiner, Adjectives, Verb, Adverb, Preposition, Conjunction and Punctuations. SubCategory Sub Category of the above feature. For example, category Pronoun has two subcategories: Subjective and Objective. Verb has two subcategories: Auxiliary and Main Verb. Sense Semantic Category of Noun. Permissible values are Human, Animate and Inanimate. Form When used with Pronoun, it represents Person (1,2,3). With Verb, it represents Verb Form (0,1,2,3,4) and with Adjective, it represents Degree (1,2,3). Number It shows that Noun or Pronoun is either Singular or Plural. Gender Gender of word (Noun, Pronoun, Verb) in Urdu. Subject This feature is associated with Verb. It has the urdu preposition that can be used Preposition after Subject of the sentence. (For Example ne, ko) Object This feature is associated with Verb. It has the urdu preposition that can be used Preposition after Object of the sentence. (For Example ko, se) Masculine This feature is associated with Noun, Adjectives and Prepositions. It has Singular Form inflected form which will be used with Masculine Singular Form. Feminine This feature is associated with Noun, Adjectives and Prepositions. It has Singular Form inflected form which will be used with Feminine Singular Form. Masculine This feature is associated with Noun, Adjectives and Prepositions. It has Plural Form inflected form which will be used with Masculine Plural Form. Feminine This feature is associated with Noun, Adjectives and Prepositions. It has Plural Form inflected form which will be used with Feminine Plural Form. Object Count This feature is associated with verb. It has number of allowable objects after the verb in a sentence. For Intransitive Verbs, the value is 0; for verbs allowing double objects, the value is 2. Urdu Meaning The meaning of word in Urdu (In Root Form) Irregular Urdu In Transformational Module, inflected forms of Urdu words will be derived by Form rules. But for words with irregular forms, a table will be consulted whose index is present at this feature. Some features of above table are worth explanation. As discussed in 2.4, Urdu nouns and verbs have gender; it is why Gender of word in Urdu is stored as a feature. Similarly as discussed in 2.3, Urdu Adjectives, Noun Modifiers and Prepositions may have inflected forms depending upon Gender and Number of the Noun Head. But there is no well defined rule(s) that a word e.g. Adjective will have inflection form or not. For Example, Adjective hara (Green) has inflected forms, but dana (wise) does not have inflected forms. Because of this reason, inflected forms of all adjectives, prepositions and noun are stored in Singular Masculine, Singular Feminine, Plural Masculine, and Plural Feminine Form Features. Subject Preposition and Object Prepositions features are associated with verbs. In some forms of sentences (for example, Past Indefinite Tense), a preposition (usually ne) is used after Subject and a preposition (usually ko) is used after Object. For Example, "I told you" will be written in Urdu as Main(I) ne tum(you) ko bataya(told) Subject Subject Preposition Object Object Preposition Verb
The Urdu Meaning of verbs is stored in un-inflected form. A rule based system in Transformation Module will inflect the verbs. The Token Generator SubModule searches the word (generated by Preprocessor SubModule) in database of words and generate one or more Token corresponding to each word (Because a word e.g. "fly" can be categorized as noun and verb both). It also uses flags to detect proper nouns and search proper nouns in a separate database of Proper Nouns. Lexical Modules also group words as sentences. When it gets a sentence separator in list of generated tokens, it ends the previous sentence and start new one. 3.2 Syntactical Module The Syntactical Module parses the sentence(s) (as collection of word tokens) coming from lexical Module. The token collection to be parsed is two-dimensional. First dimension is Word Position in the sentence and second dimension is for different tokens corresponding to the same word. Context Free Grammar for English language is developed. For Parsing, different techniques are studied. Among them, Bottom Up Chart Parsing [2] is selected for implementation. One reason of this selection is that it is more efficient than other methods. But the real advantage of this technique is that it can be used to get partial parses of the sentences. It is fact that no grammar (despite how carefully it is designed) can be used to parse all English sentences. A Bottom Up Chart Parser can give partial parses. These partial parses can be translated to give some sense of sentence, if complete parse of the sentence is not possible. These Partial Parse Structures consists of sequence of non-overlapping word tokens. The algorithm is that: If no complete parse is obtained then choose Parse Structure of maximum length (no. of tokens) from all partial parses. Then recursively find parse structure with maximum length in remaining parts of the sentence (the tokens that are not in part of the maximum length parse obtained in previous step). Finally give these partial parses as output in sequence of their order in sentence. Some Productions of Grammar developed for English are: <Sentence>
many productions of this grammar are Empty Productions. For Example. null null As empty productions are causing problems in Chart Parsing Algorithm, all Empty Productions are eliminated and new production rules are formed. For Example <Sentence> <Sentence> <Sentence> <Sentence> 3.3 Transformational Module Transformational Module takes Parse Structure as input and generates Urdu Translation of the Sentence. As discussed in Section 2 that Urdu and English are not same in Sentence or Phrase Structure. Hence Phrases are to be re-arranged in Sentence and Words are to be re-arranged in a phrase. Form of words is also to be changed depending upon other words in the phrase or other phrases. To solve these problems, the Parse Structure (Parse Tree) obtained by Syntactical Module is traversed by depth first approach. Every node has a structure attached with it which has "Collection of Urdu Translated Word Tokens" of the node and other optional attributes. These structures are named as Syntactical Tokens. When a node fills all attributes of its Syntactical Structure, it pushes it on the stack. When all children of a node are traversed, the parent node pop syntactical tokens of its children and synthesize its (parent's) syntactical token using them. This
method is repeated until whole tree is traversed. The sequence of Urdu Translated Words obtained at root node of the tree is output as Translation of English sentence into Urdu. This translation has correct word order and forms of Urdu. An Example of Syntactical Tokens is: Attibute Description Gender Gender of the Head Noun of the phrase Number Number of Head Noun of the phrase Pronoun Person If head is a pronoun then store its person, for proper nouns use 3. Head Indices Token Indices of Head Noun(s) (Many head nouns will be present if Noun Phrase has "and" for conjunction of nouns. Adjective Token Indices of Adjectives modifying the Head Noun. Indices Sense Semantic Category of Head Noun(s). Tokens Collection of Word Tokens having Urdu Translated Words. Above is the Syntactical Token of Noun Phrase(NP). It has Gender, Number, Pronoun Person and Sense so that its parent node (e.g. Sentence) can select proper form of Verb using these attributes. Head Indices and Adjective Indices are used for any required change, in the word form of Head Noun(s) and Adjective(s), by parent node after attribute Tokens is filled and this syntactical token was pushed on stack. For example, if a Preposition follows the Noun Phrase in an Urdu sentence then the form of Masculine Singular Adjective should be changed to Masculine Plural. But the fact, that either a preposition will be present after this Noun Phrase or not, can not easily determined at the synthesis of Noun Phrase Syntactical Token. Hence Adjective(s) are assigned the forms depending upon the Gender and Number of the Head Noun(s), and the Syntactical Token is pushed on the stack. If the parent node finds preposition after this noun, it will change the form of adjective using adjective indices. For Example, Noun Phrase for "cold water" will be thanda(cold) pani(water) But when this Noun Phrase becomes part of Prepositional Phrase "vessel of cold water", the adjective thanda changes to thanday. thanday(cold) pani(water) ka(of) bartan(vessel) Above example also shows the re-arrangement of Words in Prepositional Phrase. Components of English grammar rule: preposition are re-arranged in Urdu as preposition As discussed in Section 2.4 and 2.5, Verb Translation (Tense Translation) is not an easy task. The correct Urdu form of verb to be used in a sentence depends upon the number and gender of the subject and object and tense of English sentence. A rule based system transforms Urdu Root word (of verb) into Correct Urdu Form. For example, reads changes its form depending upon the subject in following sentences. woh(he) partha hay(reads) woh(she) parhti hay(reads) Translation of past indefinite tense having transitive verb (of Urdu) has many issues. For example, this type of sentences requires Subject Preposition and Object Preposition (discussed in Section 3.1). The sentences "She writes a letter.", "She wrote a letter" and "She wrote a story" will be translated as
woh(she) khat(letter) likhti hay(writes) Verb From is changed depending upon subject us(she) ne khat(letter) likha(wrote). Verb From is changed depending upon the object. Subject Preposition ne is introduced and pronoun woh is changed to us. us(she) ne kahani(story) likhi(wrote). Verb From is changed depending upon the object. Subject Preposition ne is introduced and pronoun woh is changed to us. Similarly Object Preposition can be introduced depending on the semantic of the object. 3.4 Ambiguity Resolution As discussed in 3.2, the chart parser gives all possible parses of the sentence. If no complete parse is obtained then we use an algorithm to translate on basis of partial parses. But in many cases, more than one complete parses of the sentence are generated by the Syntactical Module. The System must choose only one of these parses. For choosing the best parse, system uses some heuristics. It gives negative weights to some rules. For example, there are rules <MainVerb> MainVerb <MainVerb> AuxiliaryVerb Second rule will have negative weight because Auxiliary Verbs are rarely used as Main Verbs. Similarly there is a rule that Adjective (Pre Noun Phrase) can be used as Noun Phrase. This rule also has negative weight and only can be used when no other rule for Noun Phrase is applied. In case of multiple parses, it is rejected because that parse using it has negative weight. Hence in Ambiguity Resolution, (any) negative weights to all parses are compared and the parse with highest value (lowest number of negative weight rules used) is selected to have best translation. 4 Conclusion In this paper, structure and working of English to Urdu Translation System are discussed. Many problems of the translation are solved. The software implementation of it is successful. Most of the translated sentences have correct word order and form. The problems are mainly due to absence of grammar rule (production) for that sentence in grammar rule database. The system can be enhanced by introducing a context sensitive system for preposition translation as translation of English preposition to Urdu is context sensitive. It is planned for next phase of research. Statistical Techniques will also be introduced in next phase to make system more efficient and accurate. The system will be starting point for access of information present in English for Urdu readers of Pakistan and all over the world. References 1. Allen, James 1995. Natural Language Understanding 2. Siddiqui, Abul Lais 1971, . Jamaul Qawaid(Comprehensive Grammar) 3. Hutchins, W.John 1995 Machine Translation: A Brief History