Natural Language Processing

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Natural Language Processing as PDF for free.

More details

  • Words: 2,866
  • Pages: 9
Abstract The goal of the Natural Language Processing (NLP) group is to design and build software that will analyze, understand, and generate languages that humans use naturally, so that eventually you will be able to address your computer as though you were addressing another person. This goal is not easy to reach. "Understanding" language means, among other things, knowing what concepts a word or phrase stands for and knowing how to link those concepts together in a meaningful way. It's ironic that natural language, the symbol system that is easiest for humans to learn and use, is hardest for a computer to master. The challenges we face stem from the highly ambiguous nature of natural language. As an English speaker you effortlessly understand a sentence like "Flying planes can be dangerous". Yet this sentence presents difficulties to a software program that lacks both your knowledge of the world and your experience with linguistic structures. Is the more plausible interpretation that the pilot is at risk, or that the danger is to people on the ground? Should "can" be analyzed as a verb or as a noun? Which of the many possible meanings of "plane" is relevant? Depending on context, "plane" could refer to, among other things, an airplane, a geometric object, or a woodworking tool. How much and what sort of context needs to be brought to bear on these questions in order to adequately disambiguate the sentence? We address these problems using a mix of knowledge-engineered and statistical/machinelearning techniques to disambiguate and respond to natural language input. Our work has implications for applications like text critiquing, information retrieval, question answering, summarization, gaming, and translation. The grammar checkers in Office for English, French, German, and Spanish are outgrowths of our research; Encarta uses our technology to retrieve answers to user questions; Intellishrink uses natural language technology to compress cellphone messages; Microsoft Product Support uses our machine translation software to translate the Microsoft Knowledge Base into other languages. As our work evolves, we expect it to enable any area where human users can benefit by communicating with their computers in a natural way.

Natural language processing Natural language processing (NLP) is a subfield of artificial intelligence and linguistics. It studies the problems of automated generation and understanding of natural human languages. Natural language generation systems convert information from computer databases into normal-sounding human language, and natural language understanding systems convert samples of human language into more formal representations that are easier for computer programs to manipulate.

Contents 1 Tasks and limitations 2 Concrete problems 3 Subproblems 4 Statistical NLP 5 The major tasks in NLP 6 Evaluation of natural language processing 7 Organizations and conferences

Tasks and limitations In theory natural language processing is a very attractive method of human-computer interaction. Early systems such as SHRDLU, working in restricted "blocks worlds" with restricted vocabularies, worked extremely well, leading researchers to excessive optimism which was soon lost when the systems were extended to more realistic situations with real-world ambiguity and complexity. Natural language understanding is sometimes referred to as an AI-complete problem, because natural language recognition seems to require extensive knowledge about the outside world and the ability to manipulate it. The definition of "understanding" is one of the major problems in natural language processing.

Concrete problems Some examples of the problems faced by natural language understanding systems: •

The sentences We gave the monkeys the bananas because they were hungry and We gave the monkeys the bananas because they were over-ripe have the same surface grammatical structure. However, in one of them the word they refers to the monkeys, in the other it refers to the bananas: the sentence cannot be

understood properly without knowledge of the properties and behaviour of monkeys and bananas. •

A string of words may be interpreted in myriad ways. For example, the string Time flies like an arrow may be interpreted in a variety of ways: o time moves quickly just like an arrow does; o measure the speed of flying insects like you would measure that of an arrow - i.e. (You should) time flies like you would an arrow.; o measure the speed of flying insects like an arrow would - i.e. Time flies in the same way that an arrow would (time them).; o measure the speed of flying insects that are like arrows - i.e. Time those flies that are like arrows; o a type of flying insect, "time-flies," enjoy arrows (compare Fruit flies like a banana.)

English is particularly challenging in this regard because it has little inflectional morphology to distinguish between parts of speech. •

English and several other languages don't specify which word an adjective applies to. For example, in the string "pretty little girls' school". o Does the school look little? o Do the girls look little? o Do the girls look pretty? o Does the school look pretty?

Subproblems Speech segmentation In most spoken languages, the sounds representing successive letters blend into each other, so the conversion of the analog signal to discrete characters can be a very difficult process. Also, in natural speech there are hardly any pauses between successive words; the location of those boundaries usually must take into account grammatical and semantical constraints, as well as the context. Text segmentation Some written languages like Chinese, Japanese and Thai do not have signal word boundaries either, so any significant text parsing usually requires the identification of word boundaries, which is often a non-trivial task. Word sense disambiguation Many words have more than one meaning; we have to select the meaning which makes the most sense in context. Syntactic ambiguity The grammar for natural languages is ambiguous, i.e. there are often multiple possible parse trees for a given sentence. Choosing the most appropriate one usually requires semantic and contextual information. Specific problem components of syntactic ambiguity include sentence boundary disambiguation. Imperfect or irregular input

Foreign or regional accents and vocal impediments in speech; typing or grammatical errors, OCR errors in texts. Speech acts and plans Sentences often don't mean what they literally say; for instance a good answer to "Can you pass the salt" is to pass the salt; in most contexts "Yes" is not a good answer, although "No" is better and "I'm afraid that I can't see it" is better yet. Or again, if a class was not offered last year, "The class was not offered last year" is a better answer to the question "How many students failed the class last year?" than "None" is.

Statistical NLP Statistical natural language processing uses stochastic, probabilistic and statistical methods to resolve some of the difficulties discussed above, especially those which arise because longer sentences are highly ambiguous when processed with realistic grammars, yielding thousands or millions of possible analyses. Methods for disambiguation often involve the use of corpora and Markov models. The technology for statistical NLP comes mainly from machine learning and data mining, both of which are fields of artificial intelligence that involve learning from data.

The major tasks in NLP • • • • • • • • • • • •

Speech recognition Natural language generation Machine translation Question answering Information retrieval Information extraction Text Simplification Text-proofing Translation technology Automatic summarization Foreign Language Reading Aid Foreign Language Writing Aid

Speech recognition Speech recognition (in many contexts also known as 'automatic speech recognition', computer speech recognition or erroneously as Voice Recognition) is the process of converting a speech signal to a sequence of words, by means of an algorithm implemented as a computer program. Speech recognition applications that have emerged over the last years include voice dialing (e.g., Call home), call routing (e.g., I would like to make a collect call), simple data entry (e.g., entering a credit card number), and preparation of structured documents (e.g., a radiology report).

Voice recognition or speaker recognition is a related process that attempts to identify the person speaking, as opposed to what is being said.

Natural language generation Natural Language Generation (NLG) is the natural language processing task of generating natural language from a machine representation system such as a knowledge base or a logical form. Some people view NLG as the opposite of natural language understanding. The difference can be put this way: whereas in natural language understanding the system needs to disambiguate the input sentence to produce the machine representation language, in NLG the system needs to take decisions about how to put a concept into words.

Machine translation Machine translation, sometimes referred to by the acronym MT, is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. At its basic level, MT performs simple substitution of atomic words in one natural language for words in another. Using corpus techniques, more complex translations may be attempted, allowing for better handling of differences in linguistic typology, phrase recognition, and translation of idioms, as well as the isolation of anomalies. Current machine translation software often allows for customisation by domain or profession (such as weather reports) — improving output by limiting the scope of allowable substitutions. This technique is particularly effective in domains where formal or formulaic language is used. It follows then that machine translation of government and legal documents more readily produces usable output than conversation or less standardised text. Improved output quality can also be achieved by human intervention: for example, some systems are able to translate more accurately if the user has unambiguously identified which words in the text are names. With the assistance of these techniques, MT has proven useful as a tool to assist human translators, and in some cases can even produce output that can be used "as is". However, current systems are unable to produce output of the same quality as a human translator, particularly where the text to be translated uses casual language.

Question answering Question answering (QA) is a type of information retrieval. Given a collection of documents (such as the World Wide Web or a local collection) the system should be able

to retrieve answers to questions posed in natural language. QA is regarded as requiring more complex natural language processing (NLP) techniques than other types of information retrieval such as document retrieval, and it is sometimes regarded as the next step beyond search engines. QA research attempts to deal with a wide range of question types including: fact, list, definition, How, Why, hypothetical, semantically-constrained, and cross-lingual questions. Search collections vary from small local document collections, to internal organization documents, to compiled newswire reports, to the world wide web. •



Closed-domain question answering deals with questions under a specific domain (for example, medicine or automotive maintenance), and can be seen as an easier task because NLP systems can exploit domain-specific knowledge frequently formalized in ontologies. Open-domain question answering deals with questions about nearly everything, and can only rely on general ontologies and world knowledge. On the other hand, these systems usually have much more data available from which to extract the answer.

(Alternatively, closed-domain might refer to a situation where only a limited type of questions are accepted, such as questions asking for descriptive rather than procedural information.)

Information retrieval Information retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hypertext networked databases such as the Internet or World Wide Web or intranets, for text, sound, images or data. There is a common confusion, however, between data retrieval, document retrieval, information retrieval, and text retrieval, and each of these has its own bodies of literature, theory, praxis and technologies. IR is like most nascent fields interdisciplinary, based on computer science, mathematics, library science, information science, cognitive psychology, linguistics, statistics, physics. Automated IR systems are used to reduce information overload. Many universities and public libraries use IR systems to provide access to books, journals, and other documents. IR systems are often related to object and query. Queries are formal statements of information needs that are put to an IR system by the user. An object is an entity which keeps or stores information in a database. User queries are matched to objects stored in the database. A document is, therefore, a data object. Often the documents themselves are not kept or stored directly in the IR system, but are instead represented in the system by document surrogates.

In 1992 the US Department of Defense, along with the National Institute of Standards and Technology (NIST), cosponsored the Text Retrieval Conference (TREC) as part of the TIPSTER text program. The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for such a huge evaluation of text retrieval methodologies. Web search engines such as Google, Live.com, or Yahoo search are the most visible IR applications.

Information extraction Information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured or semistructured information from unstructured machine-readable documents. It is a sub-discipline of language engineering, a branch of computer science. It aims to apply methods and technologies from practical computer science such as compiler construction and artificial intelligence to the problem of processing unstructured textual data automatically, with the objective to extract structured knowledge in some domain. A typical example is the extraction of information on corporate merger events, whereby instances of the relation MERGE(company1,company2,date) are extracted from online news ("Yesterday, New-York based Foo Inc. announced their acquisition of Bar Corp."). The significance of Information Extraction is determined by the growing amount of information available in unstructured (i.e. without metadata) form, for instance on the Internet. This knowledge can be made more accessible by means of transformation into relational form. Natural Language texts may need to use some form a Text Simplification to create a more easily machine readable text to extract the sentences. Typical subtasks of IE are: •

• •

Named Entity Recognition: recognition of entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions. Coreference: identification chains of noun phrases that refer to the same object. For example, anaphora is a type of coreference. Terminology extraction: finding the relevant terms for a given corpus

Text Simplification

In natural language processing, text simplification is an important task due to the fact that much of the english language is in complex compound sentences that are not easily processible for information tasks.

Text–proofing (Proofreading) Proofreading traditionally means reading a proof copy of a text in order to detect and correct any errors. Modern proofreading often requires reading copy at earlier stages as well.

Translation technology Translation is the interpretation of the meaning of a text in one language (the "source text") and the production, in another language, of an equivalent text (the "target text," or "translation") that communicates the same message. Translation must take into account a number of constraints, including context, the rules of grammar of the two languages, their writing conventions, their idioms and the like. Consequently, as has been recognized at least since the time of the translator Martin Luther, one translates best into the language that one knows best. Traditionally translation has been a human activity, though attempts have been made to computerize or otherwise automate the translation of natural-language texts (machine translation) or to use computers as an aid to translation (computer-assisted translation). Perhaps the most common misconception about translation is that there exists a simple "word-for-word" relation between any two languages, and that translation is therefore a straightforward and mechanical process. On the contrary, translation is always fraught with uncertainties and with the potential for inadvertent "spilling over" of idioms and usages from one language into the other.

Automatic summarization Automatic summarization is the creation of a shortened version of a text by a computer program. The product of this procedure still contains the most important points of the original text. The phenomenon of information overload has meant that access to coherent and correctly-developed summaries is vital. As access to data has increased so has interest in

automatic summarization. An example of the use of summarization technology is search engines such as Google. Technologies that can make a coherent summary, of any kind of text, need to take into account several variables such as length, writing-style and syntax to make a useful summary.

Foreign Language Writing Aid A foreign language writing aid is a computer program that assists a non-native language user in writing decently in their target language. Assistive operations can be classified into two categories: on-the-fly prompts and post-writing checks. Assisted aspects of writing include: lexical syntax (especially the "syntactic and semantic roles of a word's frame"), lexical semantics (context/collocation-influenced word choice and userintention-driven synonym choice), idiomatic expression transfer, etc. Online dictionaries can also be considered as a type of foreign language writing aid.

Evaluation of natural language processing This section is a stub and needs to be expanded • • • • •

History of evaluation in NLP Intrinsic evaluation Extrinsic evaluation Automatic evaluation Manual evaluation

Organizations and conferences • • •

Association for Computational Linguistics Association for Machine Translation in the Americas AFNLP - Asian Federation of Natural Language Processing Associations

Related Documents