Multiple Choice Question Answering MSc Advanced Computer Science Meshael Sultan
Supervised by Dr. Louise Guthrie
Department of Computer Science University of Sheffield August 2006
This report is submitted in partial fulfilment of the requirement for the degree of MSc in Advanced Computer Science
To The loving memory of my parents Aisha and Mohammed May God mercy their souls
ii
Signed Declaration
All sentences or passages quoted in this dissertation from other people's work have been specifically acknowledged by clear cross-referencing to author, work and page(s). Any illustrations which are not the work of the author of this dissertation have been used with the explicit permission of the originator and are specifically acknowledged. I understand that failure to do this amounts to plagiarism and will be considered grounds for failure in this dissertation and the degree examination as a whole. Name: Meshael Sultan Signature: Date: 30th August 2006
iii
Abstract
People rely on information for performing the activities of their daily lives. Much of this information can be found quickly and accurately online through the World Wide Web. As the amount of online information grows, better techniques to access the information are necessary. Techniques for answering specific questions are in high demand and large research programmes investigating methods for question answering have been developed to respond to this need. Question Answering (QA) technology; however, faces some problems, thus inhibiting its advancement. Typical approaches will first generate many candidate answers for each question, then attempt to select the correct answer from the set of potential answers. The techniques for selection of the correct answer are in their infancy, and further techniques are needed to decide and to select the correct answer from candidate answers. This project focuses on multiple choice questions and the development of techniques for automatically finding the correct answer. In addition to being a novel and interesting problem on its own, the investigation has identified methods for web based Question Answering (QA) technology in selecting the correct answers from potential candidate answers. The project has investigated techniques performed manually and automatically. The data consists of 600 questions, which were collected from an online web resource. They are classified into 6 categories, depending on the questions’ domain, and divided equally between the investigation and the evaluation stages. The manual experiments were promising, as 45 percent of the answers are correct, which increased to 95 percent after the form of the queries was restructured. Automatic techniques, such as using quotation marks, and replacing the question words according to the question type it was found that the accuracy ranged between 48.5 and 49 percent. The accuracy had also increased to 63 percent and 74 percent in some categories, such as geography and literature.
iv
Acknowledgements
This project would not have existed without the will of Allah, the giver of all good and perfect gifts. His mercy and blessing have empowered me throughout my life. All praise is due to Allah for his guidance and grace. Peace and blessings be upon our prophet Mohammed. Then, I would like to thank Dr. Louise Guthrie for her supervision, help, guidance and encouragement throughout this project. I would also like to thank Dr. Mark Greenwood for providing many good ideas and resources throughout the project and Mr. David Guthrie for assisting me with the RASP program. Many thanks to my dear friend, Mrs. Basma Albuhairan for her assistance in proofreading. My deepest thanks to my lovely friends Ebtehal, Arwa and Maha for their friendship, substantive support and encouragement. Finally, I would like to thank my compassionate brother Fahad, my sweet sister Ru’a and my loving aunts Halimah, Hana, Tarfa, and Zahra who through their love supported me throughout my Masters degree.
v
Table of Contents
Table of Contents Signed Declaration .................................................................................................... iii Abstract ......................................................................................................................iv Acknowledgements .....................................................................................................v Table of Contents .......................................................................................................vi List of Tables........................................................................................................... viii List of Figures ............................................................................................................ix List of Equations .........................................................................................................x List of Abbreviations ..................................................................................................xi Chapter 1: Introduction..............................................................................................12 1.1 Background................................................................................................12 1.2 Project Aim................................................................................................13 1.3 Dissertation Structure.................................................................................13 Chapter 2: Literature Review .....................................................................................14 2.1 Question Answering System ............................................................................14 2.1.1 History of QA Systems..............................................................................15 2.1.2 QA System Types .....................................................................................16 2.1.2.1 Answers Drawn From Database..........................................................16 2.1.2.2 Answers Drawn From Free Text .........................................................18 2.2 The TREC QA Track .......................................................................................21 2.3 Evaluation Measures ........................................................................................24 2.3.1 Question Answering Measures ..................................................................24 Chapter 3: Data and Resources Used in Experiments .................................................26 3.1 Data Set ...........................................................................................................26 3.1.1 Question Set..............................................................................................26 3.1.2 Knowledge Source ....................................................................................27 3.2 Evaluation........................................................................................................27 3.3 Tools................................................................................................................28 Chapter 4: Design and Implementation ......................................................................29 4.1 Manual experiments.........................................................................................29 4.2 Automated Experiments...................................................................................33 Chapter 5: Results and Discussion .............................................................................36 5.1 Experiments Results.........................................................................................36 5.1.1 Manual Experiments Results .....................................................................36 5.1.2 Automated Experiments Results................................................................37 5.2 Future Work.....................................................................................................47 Chapter 6: Conclusions..............................................................................................48 Bibliography..............................................................................................................50 Appendix A ...............................................................................................................54 Appendix B ...............................................................................................................55 Appendix C ...............................................................................................................57 Appendix D ...............................................................................................................62 Appendix E ...............................................................................................................67 Appendix F................................................................................................................69 Appendix G ...............................................................................................................75 Appendix H ...............................................................................................................79 Appendix I.................................................................................................................85
vi
Table of Contents Appendix J ................................................................................................................87 Appendix K ...............................................................................................................88
vii
List of Tables Table 2. 1 Trigger words and their corresponding thematic roles and attributes..........18 Table 2. 2 Question categories and their possible answer types ..................................19 Table 2. 3 Sample question series from the test set. Series 8 has an organization as a target, series 10 has a thing as atarget, series 27 has a person as a target.............23 Table 4. 1 Number of Google hits for first manual experiment when a given question and given answer is used as a query in Google ...................................................30 Table 4. 2 Number of Google hits for second manual experiment when a given question and given answer is used as a query in Google .....................................32 Table 5. 1 Number of correct answers for the First and the Second Manual Experiments, and the overall accuracy for both of them .....................................36 Table 5. 2 Number of correct answers for every question category and its accuracy, in addition to the overall accuracy in First Automated Experiment.........................38 Table 5. 3 Number of correct answers for every question category and its accuracy, in addition to the overall accuracy in Second Automated Experiment.....................39 Table 5. 4 Number of correct answers for every question category and its accuracy using quotation, besides to the overall accuracy in Third Automated Experiment ..........................................................................................................................40 Table 5. 5 Number of correct answers for every question type, besides its accuracy with and without using quotation and the overall accuracy in Forth Automated Experiment ......................................................................................43 Table 5. 6 Number of correct answers for every question type in each category, besides their accuracy with and without using quotation and the overall accuracy ..........45 Table 6. 1 Evaluation for the participant systems in TREC 2005 QA track.................49
viii
List of Figures Figure 2. 1 MULDER mock up..................................................................................20 Figure 5. 1 The improvement of the accuracy for the First and the Second Manual Experiments.......................................................................................................37 Figure 5. 2 The accuracy for each category in First Automated Experiment ...............38 Figure 5. 3 The accuracy for each category in Second Automated Experiment...........39 Figure 5. 4 The accuracy for each category using quotation in Third Automated Experiment ........................................................................................................41 Figure 5. 5 The accuracy for each category with and without quotation .....................42 Figure 5. 6 The accuracy for each question type with and without quotation ..............43 Figure 5. 7 The accuracy for every question type in each category with and without using quotation ..................................................................................................46
ix
List of Equations Equation 2. 1 Mean Reciprocal Rank .........................................................................24 Equation 2. 2 Answer Recall Lenient .........................................................................25 Equation 2. 3 Answer Recall Strict ............................................................................25 Equation 2. 4 Accuracy..............................................................................................25
x
List of Abbreviations Abbreviation
Word
AI
Artificial Intelligence
ARA
Answer Recall Average
ARL
Answer Recall Lenient
ARS
Answer Recall Strict
IR
Information Retrieval
MRR
Mean Reciprocal Rank
NASA
National Aeronautics and Space Administration
NIST
National Institute of Standards and Technology
PoS
Part-of-Speech
QA
Question Answering
RASP
Robust Accurate Statistical Parsing
START
SynTactic Analysis using Reversible Transformations
TREC
Text REtrieval Conference
WWW
World Wide Web
xi
Chapter 1: Introduction
Chapter 1: Introduction 1.1 Background Searching the Web is part of our daily lives, and one of the uses of the Web is to determine the answer to a specific questions, such as “When was Shakespeare born?” Most people, however, would not choose to spend a lot of time finding the answer to such a simple question. The tremendous amount of online information is increasing daily, therefore making the search process for such a specific question even more difficult. Despite the drastic advances in search engines, they still fall short of understanding the user’s specific question. As a result, more sophisticated search tools are needed to reliably provide the correct answers for these questions. The desirability of such technology has led to a, relatively recent, increase in funding for the development of Question Answering (QA) systems. (Korfhage, 1997; Levene, 2006). In this dissertation a QA system is defined as an artificial intelligence system that uses a question as its input and attempts to provide the correct answer for that specific question. Most QA systems utilize sophisticated techniques from the natural language processing community. Since 1999, there has been a huge advancement in the performance of QA systems, but they are still in their early stages. All QA systems have faced some difficulties in selecting the actual right answer from a list of the possible answers due to the challenges and limitations of the natural language that is understood by the machine. According to the Text REtrieval Conference (TREC) (which, to encourage research and improvement of these systems arranges the annual question answering system evaluations) the best QA system was only able to answer 70 percent of the factoid questions (answers representing simple facts) correctly. This project is mainly inspired by the difficulties still present in the QA systems. The project investigates techniques for selecting accurate answers for multiple choice questions. This procedure is similar to the answer extraction stage in the traditional QA systems, since the typical systems first will generate a small set of answers prior to selecting a candidate answer. This work investigates techniques for choosing the correct candidate answer from among a small set of possible answers. The results may therefore be useful for the improvement of the current QA systems.
12
Chapter 1: Introduction
1.2 Project Aim The aim of this project is to investigate automated techniques for answering multiple choice questions. Neither commercial QA nor research QA systems that attempt to answer different types of questions, and retrieve a precise answers are perfect. Typically, the system analyses the question, finds a set of candidate answers by consulting a knowledge resource and selects among the candidate answers before presenting the correct answer to the user. One area where future research is needed is with respect to the most appropriate method for selecting the correct answer from a set of candidate answers. This project therefore focuses on techniques for improving this stage of the QA systems. In addition, the idea of automatically finding the correct answers for multiple choice questions is both novel and interesting, and is yet to be fully explored.
1.3 Dissertation Structure This dissertation is organised into 6 main chapters. Chapter 1 provides some information about the project and states the objectives and aims of the project. Chapter 2 contains the literature review that explains why QA systems are important. It also provides a brief history of the QA systems, their types, and the relationship between QA and the TREC QA track. Details about the evaluation measure for the QA systems and their metrics are also provided. Chapter 3 contains the data and resources for this project. It describes in detail the data set for the investigation experiments and the tools used within the project. Chapter 4 describes the design and implementation phases, including the experimental design and the implementation. Chapter 5 provides the results, analysis and evaluation of these experiments. As a result it states the project’s findings, the achieved objectives and further work needed for improvement. Chapter 6 draws the conclusions from the results given.
13
Chapter 2: Literature Review
Chapter 2: Literature Review
The world today is starving for more information. Organizations and humans are striving for new or stored information every second in order to improve the performance of their work and daily activities. People seek information in order to solve problems, whether it is booking a seat at the cinema to watch a movie or to find a solution to a financial problem. They rely on information resources such as their background, experts, libraries, private or public information, and the global networks, which have rapidly become the most important resource, and are considered the cheapest method, for easily accessing information from nearly everywhere. Chowdhury (1999) noticed that electronic information has become readily accessible; the rate of generating information has also increased. This information explosion has caused a major problem, which has raised the need for effective and efficient information retrieval. Even though Information Retrieval (IR) systems have emerged and improved over the last decade, users often find the search process difficult and boring, as a simple question may require an extensive search that is time consuming and exhausting. In order to meet the need for an efficient search tool that retrieves the correct answer for a question, the development of accurate QA systems has evolved into an active area of research. Such systems aim to retrieve answers in the shortest possible time with minimum expense and maximum efficiency. This chapter defines what a QA system is, provides a brief history of QA systems and their main types, illustrated with some important examples. It also explains how the TREC QA track is involved in promoting QA research, and the metrics need to evaluate QA systems.
2.1 Question Answering System The recent expansion in the World Wide Web has demanded an increase in high performance Information Retrieval (IR) systems. These IR systems help the users to seek whatever they want on the Web and to retrieve the required information. The QA system is a type of the IR system. The IR system provides the user with a collection of documents containing the search result and the user extracts the correct answer. Unlike an IR system, the QA system provides the users with the right answer for their question using underlying data sources, such as the Web or a local collection of
14
Chapter 2: Literature Review documents. Therefore, QA systems require complex natural language processing techniques to enable the system to understand the user’s questions. The long term goal is that both the explicit and implicit meaning can be extracted by such a system (Korfhage, 1997; Levene, 2006; Liddy, 2001). Salton (1968) implies that, a QA system ideally provides a direct answer in response to a search query by using stored data covering a restricted subject area (a database). Thus, if the user asks an efficient QA system “What is the boiling point of water?” it would reply “one hundred degree centigrade” or “100 °C”. Moreover Salton (1968) stated that these systems become important for different types of applications, varying from standard business-management systems to sophisticated military or library systems. These applications can help the personnel managers, bank accountant managers and military officers to easily access the various data about employees, customer accounts and tactical situations. “The task of QA system consists in analysing the user query, comparing the analyzed query with the stored knowledge and assembling a suitable response from the apparently relevant facts” (Salton and McGill, 1983:9).
2.1.1 History of QA Systems The actual beginning for the QA system was in the 1960s within the Artificial Intelligence (AI) field, developed by researchers who employed knowledge based techniques such as encyclopaedias, edited lists of frequently asked questions, and sets of newspaper articles. These researchers attempted to develop rules that enabled a logical expression to be derived automatically from natural sentences, and vice versa, and this therefore directed them to the QA system design (Belkin and Vickery, 1985; Etzioni et al., 2001; Joho, 1999). The QA systems were intended to stimulate human language behaviour, therefore being able to answer natural language questions. As a result, the machine attempts to understand the question and thus responds by answering. In other words, understanding natural language by the machine is an essential process in the development of QA systems (Belkin and Vickery, 1985; Etzioni et al., 2001; Joho, 1999). The processing of understanding natural language, however, has many complications because of its varied and complex nature. Accordingly, these complications led to the development of QA within a restricted domain of knowledge, such as databases.
15
Chapter 2: Literature Review Nevertheless, the advance in natural language processing techniques was an important reason for the recent evolvement of QA systems to be used in the information retrieval domain. (Belkin and Vickery, 1985; Etzioni et al., 2001; Joho, 1999)
2.1.2 QA System Types QA systems are classified into two types according to the source of the question’s answer, that is, whether it is from a structured source or free text. The former systems return answers drawn from a database, and there are considered to be the earliest QA systems. As Belkin and Vickery (1985) declared it does, however, have some disadvantages, as it is often restricted to a specific domain in simple English and the input questions are often limited to simple forms. The latter type generates answers drawn from free text instead of from a well-structured database. According to Belkin and Vickery (1985), these systems are not linked to a specific domain and the text can cover any subject. The following two sections describe these two types of systems in more detail.
2.1.2.1 Answers Drawn From Database The earliest QA systems are often described as the Natural Language Interface of a database. They allow the user to access the stored information in a traditional database by using natural language requests or questions. The request is then translated into a database language query e.g. SQL. For instance, if such an interface is connected to the personal data of a company and the user asks the system the following (Monz, 2003) (please note that the user’s query is in italics and the system’s response is in bold): > Who is the youngest female in the marketing department? Mari Clayton. > How much is her salary? £ 2500.00 > Does any employee in the marketing department earn less than Mari Clayton? No. Two early examples of this type are ‘Baseball’ (Chomsky et al., 1961) and ‘Lunar’ (Woods, 1973), which were developed in the 1960s. The Baseball system answers
16
Chapter 2: Literature Review English questions about baseball (a popular sport in the United States of America). The database contains information about the teams, scores, dates and locations of these games, taken from lists summarizing a Major League season’s experience. This system allows the users, using their natural language, to communicate with an interface that understands the question and the database structure. The user could only insert a simple question without any connectives (such as ‘and’, ‘or’, etc.), without superlatives such as ‘smallest’, ‘youngest’, etc. and without a sequence of events (Chomsky et al., 1961). This system answered questions such as “How many games did the Red Sox play in June?”, “Who did the Yankees lose to on August 10?” or “On how many days in June did eight teams play?” As Sargaison (2003) explained, this is done by analysing the question and matching normalised forms of the extracted important key-terms such as “How many”, “Red Sox” and “June” against a pre-structured database that contained the baseball data, then returning an answer to the question. The Lunar system (Woods, 1973) is more sophisticated, providing information about chemical-analysis data on lunar rock and soil material that were collected during the Apollo moon missions, enabling lunar geologists to process this information by comparing or evaluating them. This system answered 78 percent of the geologist questions correctly. Gaizauskas and Hirschman (2001) believed that Lunar was capable of answering questions such as “What is the average concentration of aluminium in high alkali rocks?” or “How many Brescias contain Olivine?” by translating the question into a database-query language. An example of this (Monz, 2003) is as follows: The question “Does sample S10046 contains Olivine?” Will be translated to the following query “(TEST (CONTAIN S10046 OLIV))”. Additionally Monz (2003) stated that Lunar can manage a sequence of questions such as “Do any sample have aluminium grater than 12 percent?” “What are these samples?” Despite the significant success of the previous systems, they are restricted to a specific domain and are associated with a database storing the domain knowledge. The results of the database based QA systems are not directly comparable to the open domain question-answering from an unstructured text.
17
Chapter 2: Literature Review 2.1.2.2 Answers Drawn From Free Text The next generation of QA systems evolved to extract answers from the plain machine-readable text that is found in a collection of documents or on the Web. Monz (2003) believed that the need for these systems is a result of the increasing growth in the Web, and the user’s demand to access the information in a simple and fast fashion. The feasibility of these systems is due to the advancement in the natural language processing techniques. As the QA systems are capable of answering questions about different domains and topics, these systems are more complex than the database based QA systems, as they need to analyze the data in addition to the question, in order to find the correct answer. QUALM system was one of the early movements toward extracting answers from a text. As Pazzani (1983) revealed, this QA system was able to answer a question about stories by consulting scripted knowledge that is constructed during the story understanding. Ferret et al. (2001) added that it uses complex semantic information, due to its dependency on the question taxonomy that is provided by the question type classes. This approach was the inspiration for many QA systems (refer to Appendix A for the 13 conceptual question categories used in Wendy Lehnert’s QUALM which was taken from Burger et al. (2001)). The Wendlandt and Driscoll system (Driscoll and Wendlandt, 1991) is considered to be one of the early text based QA system. This system uses National Aeronautics and Space Administration (NASA) plain text documentation to answer questions about the NASA Space Shuttle. Monz (2003) cited that it uses thematic roles and attributes occurring in the question to identify the paragraph that contains the answer. Table 2.1 was taken form Monz (2003) to illustrate examples of the trigger words and the corresponding thematic roles and attributes. Trigger word Area Carry Dimensions In Into On Of To
The corresponding thematic roles and attributes Location Location Size destination, instrument, location, manner, purpose location, destination location, time Amount location, destination, purpose
Table 2. 1 Trigger words and their corresponding thematic roles and attributes
18
Chapter 2: Literature Review Murax is another example for this type of QA system. It extracts answers for general fact questions by using an online version of Grolier’s Academic American Encyclopaedia as a knowledge base. According to Greenwood (2005), it accesses the documents that contain the answer via information retrieval and then analyses them to extract the exact answer. As Monz (2003) explained, this is done by using question categories that focus on types of questions likely to have short answers. Table 2.2 was taken from Monz (2003) and illustrates these question categories and their possible answer type: Question type Who / Whose What / Which Where When How many
Answer type Person Thing, Person, Location Location Time Number
Table 2. 2 Question categories and their possible answer types
More recently, systems have been developed that attempt to find the answer to a question from very large document collections. These systems attempt to simulate finding answers on the Web. Katz (2001) states that START (SynTactic Analysis using Reversible Transformations) is one the first systems that tried to answer a question using a web interface. It has been available to users since 1993 on the World Wide Web (WWW). According to Katz (2001), this system, however, focuses on answering questions about many subjects, such as geography, movies, corporations and weather. This requires time and effort in expanding its knowledge base. On the other hand, MULDAR was considered the first QA system that utilizes the full Web. It uses multiple search engine queries, and natural language processing, to extract the answer. Figure 2.1 is a mock up for MULDAR based on an example have seen in Etzioni et al. (2001) due to the original system is no longer being available online:
19
Chapter 2: Literature Review
MULDER “The Truth is Out There”
Your question
Who was the first American in space?
Ask
Mulder is 70% confident the answer is Alan Shepard. The following are possible answers, list in order of confidence: 1.
Alan Shepard (70%) History is Bunk … may have been the “first man on the moon,” |, Alan Shepard, was the first American in space, and therefore it is much more exciting that |, Alan Shepard, have now walked… Man in Space … the February 20th 1962 but he was the first American in space because the United State needed a lot of competition to equal the Soviet. The first American in space was Alan B. Shepard May 5th 1961… Shepard On May 5, 1961, Shepard became the first American in space. After several tours of duty in the Mediterranean, Shepard became one of the Navy’s top test pilots and took part in high-
2.
John Glenn (15%) Views on the Extraterrestrial hypothesis, UFOs, flying disks John Glenn was the first American in Space, flying on the Mercury 6 mission. A Bachelor of Science in engineering from Muskingum College, Glenn has also recently.
Figure 2. 1 MULDER mock up
According to (WWW2), the general text based QA system architecture contains three modules: question classification, document retrieval and answer extraction. The question classifier model is responsible for analysing and processing a question to determine its type and consequently the type of answer would also be determined. The document retrieval model is used to search through a set of documents in order to return the documents that might have the potential answer. Finally, the answer extraction model identifies the candidate answers in order to extract the accurate answer. Thus, they may apply the statistical approach which could use the frequency of the candidate answers in the documents collection. An external database could also be used to verify that the answer and its category are appropriate to the question classification. For instance, if the question starts with the word ‘where’, its answer should be a location and this information used in selecting the correct answer.
20
Chapter 2: Literature Review This type of QA system is the focus of this project, as techniques for selecting the correct answers for multiple choice questions from free text over the web are investigated.
2.2 The TREC QA Track TREC is co-sponsored by the National Institute of Standards and Technology (NIST) and the United States Department of Defence (WWW4, 2004). It was started as a part of the TIPSTER Text program in 1992 and it attempts to support research in the community of information retrieval, thereby trying to transfer the technology from the research laboratories to the commercial products as quickly as possible. In the beginning, the QA systems were isolated from the outside world in the research laboratories. Thereafter, they started to appear during the TREC competition. (WWW3, 2005) states that when the QA track at TREC was started in 1999, “in each track the task was defined such that the systems were to retrieve small snippets of text that contained an answer for open-domain, closed-class questions (i.e., fact-based, short-answer questions that can be drawn from any domain)”. For every track, therefore, the participants would have a question set to answer and a corpus of documents, which is a collection of text documents drawn from newswires and articles, in which the answers to the questions are included in at least one of these documents. These answers are evaluated according to the answer patterns, and the answer will be considered correct if it matches this answer pattern. TREC assesses the answers accuracy for the competitive QA systems to judge their performance. The first track TREC-8 (1999), as Voorhees (2004) clarified, was focused on answering factoid questions, which are fact-based questions, such as, taken from (WWW5, 2005), “Who is the author of the book, "The Iron Lady: A Biography of Margaret Thatcher?” The main task is to return a 5-ranked list of strings containing the answer. Sanka (2005) revealed that in TREC-9 (2000) the task was similar to TREC-8 but the question set was increased from 200 to 693 questions. These questions were developed by extracting them from a log of questions submitted to the Encarta system (Microsoft) and questions derived from an Excite query log; whereas the questions in TREC-8, were taken from a log of questions submitted to the FAQ Finder system. Moreover, both of TREC-8 and TREC-9 used the Mean Reciprocal Rank (MRR) score to evaluate the answers. As Voorhees (2004) mentioned, in the TREC 2001 QA track, however, the question set included 500 questions obtained from the MSN Search logs and AskJeeves logs. It
21
Chapter 2: Literature Review also contained different tasks, the main task being similar to the previous tracks. However, the other tasks were to return an unordered list for short answers and to test the context task through a series of questions. As a result, these different tasks have different evaluation metrics and reports. Furthermore Voorhees (2004) stated that the TREC 2002 QA track had two tasks: the main task and the list task. In contrast to the previous years, the participants should return the exact answer for each question as it did not allow the text snippet that contained the answers. Accordingly, they used a new scoring metric called the confidence-weighted score to evaluate the main task. and another evaluation metric was used for the other task. Voorhees (2004) also reported that the TREC 2003 QA track test set contained 413 questions taken from MSN and AOL Search logs. It included two tasks, that is, the main task and the passage tasks. These tasks have different types of test question. The main task has a set of list questions, definition questions and factoid questions. The passage task, however, has a set of factoid questions in which the systems should, to answer them, return a text snippet. This track has different evaluation metrics because it has different tasks and different types of question. In regards to the TREC 2004 QA track Sanka (2005) and Voorhees (2004) mentioned that the question set was similar to TREC 2003, i.e. drawn from MSN and AOL Search logs. It contained a series of questions (of several question types, including factoid questions, list questions and other questions similar to the definition questions in TREC 2003) asking for a specific target. This target might be a person, an organization, an event or a thing. Table 2.3 illustrates a sample of a question set (WWW6, 2005).
22
Chapter 2: Literature Review
Series No. 8
10
27
Target Black Panthers
Prions
Jennifer Capriati
Question No. 8.1
Question type Factoid
8.2 8.3 8.4
Factoid Factoid List
8.5 10.1 10.2 10.3
Other Factoid Factoid List
10.4
List
10.5 27.1
Other Factoid
27.2 27.3 27.4 27.5
Factoid Factoid Factoid Other
Question Who founded the Black Panthers organization? When was it founded? Where was it founded? Who have been members of the organization? Other What are prions made of? Who discovered prions? What diseases are prions associated with? What researchers have worked with prions? Other What sport does Jennifer Capriati play? Who is her coach? Where does she live? When was she born? Other
Table 2. 3 Sample question series from the test set. Series 8 has an organization as a target, series 10 has a thing as atarget, series 27 has a person as a target.
Also, TREC 2004 has a single task that is evaluated by the weighted average of the three components (factoid, list and other). Moreover, and according to Voorhees (2005), TREC 2005 QA track uses the same type of question set used in the TREC 2004 (with the 75 series questions). In addition to the main task, it has other tasks similar to TREC 2004, such as returning a ranked list for the documents that have the answers, and a relationship task to return evidence for the relationship hypothesized in the question or the lake of this relationship. The main task is evaluated by the weighted average of the three components (factoid, list and other). The other tasks, however, have different evaluation measures. Thus, the TREC QA track is an effective factor in the development of QA systems. It has also helped in evolving the evaluation metrics in order to bring about a more accurate judgement of the QA systems.
23
Chapter 2: Literature Review
2.3 Evaluation Measures
This task is a vital stage, as it will assess the performance of the systems, as well as the accuracy of the techniques. Subsequently, it will be used to make a confidant conclusion about efficiency of performance. Joho (1999) stated that TREC started evaluating the QA system in 1999, as a result of the QA system developer’s need for an automated evaluation method that measures the system’s improvement. Since then, it has added a track for the QA system to evaluate the open-domain QA system (that provides an exact answer for a question by using a special corpus). This corpus includes a large collection of newspaper articles where for each generated question there is at least one document containing the answer (WWW3, 2005). Hence, the QA system consists of document-retrieval and answer-extraction modules. The evaluation measures are divided into two types (WWW2): information retrieval measures that are used in evaluating the document retrieval module and the QA measures that evaluate the complete QA system. This dissertation, however, only discusses the evaluation measures for QA. The other measures are not applicable to this project, since a QA system will not be created, merely a simple program that represents the answer extraction module in the standard QA system.
2.3.1 Question Answering Measures These measures will be used to evaluate the whole QA system. Some of them are also used by the TREC, such as the Mean Reciprocal Rank (MRR). The MRR measures the QA system that could retrieve a list of 5 answers for each question, which will be scored by inversing the rank of the first correct answer. According to Greenwood (2005), MRR is given by Equation 2.1 (Where |Q| is the number of questions and ri is the answer rank for question i). |Q|
Σ 1/r
i
i=1
MRR = |Q| Equation 2. 1 Mean Reciprocal Rank
24
Chapter 2: Literature Review
This means that if the second answer in the list is the correct answer for a question then the reciprocal rank will be ½. This rank is applied for every question, afterwards the average score is calculated for all of the questions. Consequently, this means that the higher the value of the metrics, the better the performance of the system. There are also other measures which were specified by Joho (1999). For example, Mani’s method applies three measures Answer Recall Lenient (ARL), Answer Recall Strict (ARS) and Answer Recall Average (ARA). These are used to evaluate the summaries found by a question responding to a task requiring a more detailed judgment. ARL and ARS are defined in Equation 2.2 and Equation 2.3 (Where n1 represents the number of the correct answers, n2 is the number of partially correct answers and n3 is the number of the answered questions). n1+ (0.5*n2)
ARL =
n3
Equation 2. 2 Answer Recall Lenient
n1 ARS =
n3
Equation 2. 3 Answer Recall Strict
ARA is the average of the ARL and ARS. This measure, however, is not used during the project, as it needs a summary for the answer and this is not available in this case. More so, it needs to determine if the answer is correct, partially correct or missing. In addition, Greenwood (2005) uses simple accuracy metrics as another evaluation measure, defined as the ratio of the correct answers to the number of questions. This is represented by Equation 2.4 (Where |Q| is the number of questions and |C| is the number of correct answers).
accuracy =
|C|
* 100
|Q|
Equation 2. 4 Accuracy
These metrics are used throughout the project, thus allowing the accuracy of the techniques to be determined.
25
Chapter 3: Data and Resources Used in Experiments
Chapter 3: Data and Resources Used in Experiments
This section discusses in detail the requirements of the investigation, such as the data set (question set and the knowledge source) and the tools and evaluation methods used in this project.
3.1 Data Set The data set includes the question set and the corpus of knowledge or the knowledge source. The question set is a set of multiple choice questions with four choices. The knowledge source, however, is the source which the techniques refer to for finding the correct answer amongst the four choices.
3.1.1 Question Set Throughout the investigative stage the questions were collected from an online quiz resource, which is Fun Trivia.com (WWW1). This could be relied on, as it has more than 1000 multiple choice questions and their correct solutions. Also, these questions cover different domains such as entertainment, history, geography, sport, science/ nature and literature (refer to Appendix B for question examples for each category). A set of 600 randomly chosen questions that were divided equally between the 6 domains and their answers were collected for this project. The answers were withheld in the evaluation stages. These questions are found in html format and their correct answers are found in separate html files, thus making the comparison process between the technique answers and the correct answers more complicated. Hence, we reformatted the 600 multiple choice questions manually into text files, therefore each question is followed by its correct answer. Below is an example:
26
Chapter 3: Data and Resources Used in Experiments qID= 1 Television was first introduced to the general public at the 1939 Worlds Fair in a. New York b. Seattle c. St. Louis d. Montreal answer: a A tag was also added defining the end of a question. This involves automating the techniques so that the program will know where each question begins and ends.
3.1.2 Knowledge Source The knowledge source was derived from the World Wide Web, as these investigated techniques involved the use of the Google Search Engine in deciding which choice is the correct answer for multiple choice questions. Google is currently one of the most popular and one of the largest search engines on the Internet. It was chosen for two main reasons (mentioned in Etzioni et al. (2001)): first it has the widest coverage, since it has indexed more than one billion web pages and secondly it ranks the pages according to which the page has the highest information value.
3.2 Evaluation The evaluation scheme is very important because it evaluates the experiments according to their results. In this stage, each manually investigated technique is evaluated by using the accuracy metric previously mentioned in section 2.3.1. This helps in determining its performance in finding the correct answers amongst the other choices. Also, the answers at this stage are compared with the results in order to determine which technique is more appropriate. The results of this evaluation are discussed in detail in section 5.1.1. Moreover, the techniques will be applied automatically over a large set of questions without prior knowledge of the answers. After comparing the answers with the results, which is described in section 5.1.2, the precision and accuracy of the technique are determined through the aforementioned accuracy metric.
27
Chapter 3: Data and Resources Used in Experiments
3.3 Tools An exploration of the requirements from the data set to this evaluation has been undertaken. Therefore, the resources required to complete the project can be specified. The appropriate programming language to be used should be decided upon. Java is selected for its flexibility and familiarity. Additionally, another program is used for further experiments. This program is called Robust Accurate Statistical Parsing (RASP). RASP is a parsing toolkit that is a result of the cooperation between the Department of Informatics at Sussex University and the Computer Laboratory at Cambridge University. It was published in January 2002 and is used to run the input text through many processes for tokenisation, tagging, lemmatization and parsing (Briscoe and Carroll, 2002; Briscoe and Carroll, 2003; WWW7).
28
Chapter 4: Design and Implementation
Chapter 4: Design and Implementation The aim of this project is to investigate techniques to find the correct answer for multiple choice questions. This investigation involves the use of the Google Search Engine. This chapter explains the design of the manual experiments and the implementation of the automated techniques.
4.1 Manual experiments The initial experiments were performed manually and then automated for precision and accuracy of results. In the initial investigations, the question set sample is small (20 questions) and fact-related questions are pertaining to famous individuals and historical subjects. In the first experiment, the question is combined with each choice and passed to Google manually. The choices are ranked depending on the search result number. For instance, for the question: The Roanoke Colony of 1587 became known as the 'Lost Colony'. Who was the governor of this colony? a. John Smith b. Sir Walter Raleigh c. John White d. John White The question is combined first with choice (a) then sent to Google as follows: The Roanoke Colony of 1587 became known as the 'Lost Colony'. Who was the governor of this colony John Smith This process is repeated with each choice. Subsequently, the choice with the highest search result hits is then selected and if the numbers of search result hits are the same, then the result is chosen in the order which appeared in the question. Table 4.1 illustrates the results of the first experiment. For question 7, the number of search result hits was equal for the four choices; therefore the first choice was selected as the possible answer. Hence, there were 8 correct answers for this experiment, so an accuracy of 40 percent.
29
Chapter 4: Design and Implementation Question No.
Choice a
Choice b
Choice c
Choice d
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
65
138
168
254
567
524
735
517
9
38
51
39
13500
12600
11100
9830
648
548
426
440
64
75
78
74
5
5
5
5
31
33
34
27
221
217
203
271
27200
10400
510
12400
63
57
46
39
4
4
5
5
12
92
98
47
244
342
841
11400
16
7
20
40
609000
367000
301000
101000
43
70
111
37
39
43
42
42
24
11
70
69
133
40300
218
337
Answer Accuracy of the answer Correct d Correct c Correct c Wrong a Wrong a Wrong c Wrong a Correct c Wrong d Correct a Wrong a Correct c Correct c Wrong d Wrong d Wrong a Wrong c Correct b Correct c Wrong b
Table 4. 1 Number of Google hits for first manual experiment when a given question and given answer is used as a query in Google
After this experiment, this technique was implemented and a larger number of questions were posed to Google for better accuracy and precision of results. The implementation used Java as a programming language and this is described in detail in section 4.2. In the second experiment, the performance was increased by restructuring the form of the query. This was done by taking the same sample used in the first experiment and then choosing the questions that had failed to answer the question correctly (these were 11 questions). The questions were modified by using important keywords in the question and combining them with each choice as follows:
30
Chapter 4: Design and Implementation Question 4: In 1594, Shakespeare became an actor and playwright in what company? a. The King's Men b. Globe Theatre c. The Stratford Company d. Lord Chamberlain's Men The previous question had a wrong answer in the first experiment; therefore the query form was restructured and combined with each choice as follows: “In 1594” Shakespeare became an actor and playwright in “(one of the choices)” This query was combined with choice (a) and sent to Google as follows: “In 1594” Shakespeare became an actor and playwright in “The King's Men” This was then repeated with each choice. In the previous example, keywords in the question were chosen to improve the accuracy of the answer. The year was put in quotation marks as follows: “In 1594” It was noticed that the quotation marks enhanced the search results. The question word and the noun following the question word were replaced with a quoted choice as the possible answer.
Question 5: What astronomer published his theory of a sun- centered universe in 1543? a. Johannes Kepler b. Galileo Galilei c. Isaac Newton d. Nicholas Copernicus This question was restructured by replacing “What astronomer” with a quoted choice and the year was also put in quotation marks, “in 1543”, as follows: “(one of the choices)” published his theory of a sun- centered universe “in 1543”
Question 6: This Russian ruler was the first to be crowned Czar (Tsar) when he defeated the boyars (influential families) in 1547. Who was he? a. Nicholas II b. Ivan IV (the Terrible) c. Peter the Great d. Alexander I
31
Chapter 4: Design and Implementation In question 6 “This Russian ruler” was replaced with a quoted choice and some important words such as “first”, “Czar” and “defeated the boyars” were used. Moreover, the year was put in quotation marks; hence the query is as follows: “(one of the choices)” was first Czar defeated the boyars “in 1547”
For the remaining questions, please refer to Appendix C. As a result of the previous changes, there were 19 correct questions from 20. This is highlighted in Table 4.2. Therefore, the changes have improved the accuracy of the results to 95 percent. Question No.
Choice a
Choice b
Choice c
Choice d
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
65
138
168
254
567
524
735
517
9
38
51
39
348
301
6
461
230
309
201
465
65
171
109
70
10400
731
25500
603
31
33
34
27
225
49
216
137
27200
10400
510
12400
48400
72200
60000
37800
4
4
5
5
12
92
98
47
0
0
3
0
9
55
5
10
609000
367000
301000
101000
0
12
0
0
39
43
42
42
24
11
70
69
2
99
4
307
Answer Accuracy of the answer Correct d Correct c Correct c Correct d Correct d Correct b Correct c Correct c Correct a Correct a Correct b Correct c Correct c Correct c Correct b Wrong a Correct b Correct b Correct c Correct d
Table 4. 2 Number of Google hits for second manual experiment when a given question and given answer is used as a query in Google
During these manual experiments, some special and interesting cases about the structuring of the questions were found. For example, when the choice is the complete name (first name and surname) for a famous person it was found that it worked better to combine the query with the surname, as the first name is rarely used (as can be seen
32
Chapter 4: Design and Implementation in question 15 in Appendix C). Moreover, it was noticed that the answer’s accuracy could be enhanced in several ways. This could be done by restructuring the form of the query by using quotation marks, as can be seen in questions 4, 5, 6, 7, 9, 14, 17 and 20 in Appendix C. It could also be done by selecting the main keywords in the query as can be seen in questions 6, 7, 11 and 15 in Appendix C. Additionally, it could also be achieved by replacing the question words (What, Who, Which and Where) and the noun following the question word with one of the choices, as can be seen in questions 4, 5, 7, 9, 11, 15 and 20 in Appendix C. These interesting changes were done manually and some of these were applied by automated means as explained in section 4.2.
4.2 Automated Experiments In the first automated experiments, the data set contained 180 questions encompassing different domains that were divided into 30 questions for each category. In this stage, the multiple choice questions were collected in a text file and reformatted, as explained previously in section 3.1.1. A program named QueryPost was written using Java. This uses Google class which was written by Dr. Mark Greenwood. This program combines the question with each choice. Thereafter, this was passed to Google to retrieve the search results. The program code is shown in Appendix D. Thus, the choice with the highest search result hits is selected as the answer to the question. Two text files are generated. The first file is the main file, containing the number of hits for the query and for the query with each choice. This also represents the correct choice and the technique choice, as can be seen in Appendix E, for a sample of the output. The second file records each action the program undertakes and this can be used as the reference source when needed. Moreover, in the second experiment, the data set was larger than the previous experiment, as it contained 600 questions that were divided into 100 questions for each category. These questions were then sent to Google, also using the QueryPost program. After this experiment, another manual experiment was undertaken, as previously mentioned in section 4.1, to investigate the different methods for finding the correct answer for multiple choice questions. According to the observations in questions 4, 5, 6, 7, 9, 14, 17 and 20 from the second manual experiment, it was noticed that using quotation marks improves the accuracy of the answer. Therefore, a third experiment was undertaken that involved modifications to the code of the QueryPost program according to the observations noted. As a result, the new version of the QueryPost program was named QueryPost2, which is capable of combining the question with each choice by using two techniques.
33
Chapter 4: Design and Implementation The answer with quotation marks and the answer without quotation marks can be seen in the program code in Appendix F. Moreover, 3 text files are generated. The two main files are for each technique that contains the number of search result hits for the query and for the query with each choice. These files also represent the correct choice and the technique’s choice. The third file records each action the program undertakes and this can be used as the reference source when needed. This program was also run using the same question set used in the second experiment. Additionally, the forth experiment relied on the perceptions from questions 4, 5, 7, 9, 11, 15 and 20 as can be seen in Appendix C. In the second manual experiment, it was noticed that replacing the question words (What, Who, Which and Where) and the nouns following the question words, by the answer, enhances the accuracy of the approach. Thus, to apply this method when using an automated technique a program was written categorising the questions according to the question words. The program is called QueryFilter as documented in the program code in Appendix G. This program reads the query from the Query text file that contains 600 multiple choice questions, and then extracts and filters the question and the sentence that begins with ‘What’, ‘Who’, ‘Where’, ‘Which’ and ‘This’. Thereafter, the program removes the words ‘Who’, ‘Where’, ‘Which’, ‘What’ and ‘This’ from the questions and writes them in 5 separate text files. The first file is for the ‘Who’ questions, the second file is for the ‘Where’ questions, the third file is for the ‘Which’ questions and the forth file is for the ‘What’/‘This’ questions (the sentences in the forth file are treated in the same manner and will require additional processing steps to remove the noun words following them). The fifth file is for other questions that are started with other question words. Then the first three text files and the fifth file are sent to another program called QueryPost3. This program is the same as the QueryPost2 program. However, the order of the question and the choices are modified. Firstly, each choice is combined to replace the question words with the new question by using two techniques, that is, the answer in quotation marks and the answer without quotation marks. This is then passed to Google to retrieve the search results as can be seen in the program code in Appendix H. Thus, the choice with the highest result is selected as the answer to the question. Nevertheless, in the ‘What’/’This’ scenarios, the text file contains ‘What’/’This’ queries in the RASP program, for tagging the words in the file. According to Briscoe and Carroll (2002), this program runs the input text file through certain processes. It first tokenizes the words in the input file to generate a sequence of tokens, which are separated from the punctuations by spaces, and marks the boundaries of the system. These tokens are then tagged with one of the 155 CLAWS-2 part-of-speech (PoS) and punctuation labels. Therefore, after this process the output is a text file which contains words tagged with PoS. This results in tagging all the noun words as can be seen in Appendix I. Some modifications, however, are made to the output file, as there are
34
Chapter 4: Design and Implementation different types of nouns having different tags, as can be seen in Appendix J. These tags are modified to only one tag (_NN1) manually, thus making it easier to be recognized by the NounRemove program. This program was written, and a text file made, to delete the noun words present in the queries as can be seen in Appendix K. This text file is sent to another program called QueryPost3 to combine each choice with the question by using two techniques, that is, the answer in quotation marks and the answer without quotation marks. This was finally passed to Google to retrieve the search results.
35
Chapter 5: Results and Discussion
Chapter 5: Results and Discussion
In this chapter, the results of the experiments are presented. The results are divided into two sections. One section is for the manual results and the second section is for the automated results. An analysis and discussion for clarification of the findings is then given. This in turn determines the success and failure of the goals of the project. Moreover, suggestions are also given for further work that can be done in order to improve the accomplishments of the project.
5.1 Experiments Results We have investigated a number of experiments which were done manually then we have applied them automatically. Thus, we divided the results of these experiments according to their type as the following:
5.1.1 Manual Experiments Results The manual experiments were two experiments using a small set of multiple choice questions (20 questions) to perform the investigations. In the first experiment, the question was combined with its choices and sent to Google as a query. As a result, there were 9 correct questions out of 20, given an accuracy of 45 percent as can be seen in Table 5.1 and Figure 5.1. Since this is a promising result, it was decided to automate this technique thereby being able to apply it on a larger set of questions. Moreover, to better improve the technique, another manual experiment was undertaken. In this experiment the query forms were restructured by using important key words, using quotation marks and replacing the question words as previously explained in detail in section 4.1. According to Table 5.1, there are 19 correct questions out of 20, which increased the accuracy to 95 percent as shown in Figure 5.1. Exp First Manual Exp. Second Manual Exp.
No of correct answers 9 19
No of questions 20 20
Accuracy 45% 95%
Table 5. 1 Number of correct answers for the First and the Second Manual Experiments, and the overall accuracy for both of them
36
Chapter 5: Results and Discussion
Accuracy
100% 90% 80% 70% 60% 50% 40% 30%
accuracy
20% 10% 0% First Manual Exp.
Second Manual Exp.
Experiment Figure 5. 1 The improvement of the accuracy for the First and the Second Manual Experiments
5.1.2 Automated Experiments Results In the automated experiments, the manual experiments were applied using some of the programs written for this purpose. In the first automated experiment, a small data set containing 180 questions from different categories was used, as previously mentioned in section 4.2. These questions had been read and sent to Google by the QueryPost program to retrieve the search result. This program chooses the answer amongst the four choices of the question. The results for this experiment are illustrated in Table 5.2. This shows the number of correct answers for every category and its accuracy. The overall accuracy for the whole question set is also shown.
37
Chapter 5: Results and Discussion
Question category
Entertainment Geography History Literature Science/Nature
No of correct answers for each category 7 17 8 18 11
No of questions Accuracy in each category for each category 30 23.3% 30 56.7% 30 26.7% 30 60% 30 36.7%
Sport Total
20 81
30 180
66.7% 45%
80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% Sp or t
Accuracy for each category
Hi st or y Li te ra Sc tu ie re nc e/ Na tu re
En te rta in m en G t eo gr ap hy
Accuracy
Table 5. 2 Number of correct answers for every question category and its accuracy, in addition to the overall accuracy in First Automated Experiment
Question Category Figure 5. 2 The accuracy for each category in First Automated Experiment
According to Figure 5.2, the highest accuracy percentage was for the sports category, as it achieved 66.7 percent. The entertainment category, however, was lower, as it only achieved 23.3 percent. In addition, the accuracy was higher than 50 percent in both the literature and the geography categories. It decreased, however, to 36.7 percent in the science/nature category. Furthermore, a decrease in accuracy was also evident in the history category. In the same manner and as previously detailed in section 4.2, the second automated experiments were undertaken by running the QueryPost program on a large set of
38
Chapter 5: Results and Discussion questions containing 600 multiple choice questions (for more accurate results). These results are shown in Table 5.3. This shows the number of correct answers for every category and its accuracy, in addition to the overall accuracy for the whole question set. Question category
Entertainment Geography History Literature Science/Nature Sport Total
No of correct answers for each category 26 62 35 67 40 45 275
No of questions in each category 100 100 100 100 100 100 600
Accuracy for each category 26% 62% 35% 67% 40% 45% 45.8%
Table 5. 3 Number of correct answers for every question category and its accuracy, in addition to the overall accuracy in Second Automated Experiment
80% 70% 60% 50% 40% 30% 20% 10% 0% Sp or t
Accuracy for each category
Hi st or y Li te ra Sc tu ie re nc e/ Na tu re
En te rta in m en t G eo gr ap hy
Accuracy
The overall accuracy result is shown in Table 5.3. This shows that the accuracy does not change much compared to the overall accuracy in the first automated experiment; hence the size of the question set does not affect their accuracy.
Question Category Figure 5. 3 The accuracy for each category in Second Automated Experiment
39
Chapter 5: Results and Discussion From Figure 5.3 it can be said that the results in the second experiment are similar to the first experiment, as the three categories literature, geography and sport still have the highest accuracy percentage. The highest accuracy this time however, was in the literature category, as it had 67 percent accuracy, while the sport category dropped to 45 percent accuracy. At the same time, the geography’s accuracy increased to 62 percent, but the lowest accuracy was still for the entertainment category, as it obtained 26 percent, which was less than history’s accuracy, which was 35 percent. The third experiment was done in light of the findings of the second manual experiment. It was found that using quotation marks may have improved the accuracy of the answers. Therefore, this technique was applied on the same question set that was used in the second automated experiment by using the QueryPost2 program. The results of this experiment are summarised in Table 5.4, which represents the number of correct answers for every category, and its accuracy using the quotation marks, as well as the overall accuracy for this experiment. Question category
Entertainment Geography History Literature Science/Nature Sport Total
No of questions in each category
No of correct answers for each category using quotation 30 62 40 75 39 48 294
100 100 100 100 100 100 600
Accuracy for each category using quotation 30% 62% 40% 75% 39% 48% 49%
Table 5. 4 Number of correct answers for every question category and its accuracy using quotation, besides to the overall accuracy in Third Automated Experiment
40
Chapter 5: Results and Discussion
80% 70%
Accuracy
60% 50% Accuracy for each category using quotation
40% 30% 20% 10%
En te rta in m en G t eo gr ap hy Hi st or y Li te Sc ra ie tu nc re e/ Na tu re Sp or t
0%
Question Category Figure 5. 4 The accuracy for each category using quotation in Third Automated Experiment
According to Figure 5.4, it can be concluded that literature still comes out on top with the highest accuracy percentage, as it attained 75 percent. This was followed by the geography category, as its accuracy was 62 percent. Moreover, the accuracy dropped to 48 percent in sport, but this percent is high in comparison to history, which has 40 percent. At the same time, Sport still remains in the lowest level of accuracy with 30% accuracy. However, the accuracy increased to 39 percent in the science/nature category.
41
Chapter 5: Results and Discussion
80% 70%
Accuracu
60% 50%
Accuracy for each category without quotation
40%
Accuracy for each category using quotation
30% 20% 10%
To ta l
Sp or t
Li te Sc ra ie tu nc re e or N at ur e
to ry Hi s
En te rta in m
en t G eo gr ap hy
0%
Question Category
Figure 5. 5 The accuracy for each category with and without quotation
Figure 5.5 summarises the results for the second and the third experiments as shown in the variation of accuracy in all of the categories between the two techniques, that is, with quotation marks and without quotation marks. Thus, it can be concluded that finding the correct answer is easier for the literature and geography questions as compared to the entertainment question, which was more difficult. It was also found that using the quotation marks enhances the selection of the correct answer in all categories with at least 1 percent, with the exception of the geography category. The percentage in this category did not change, even with the quotation marks. Additionally, the overall accuracy for using quotation marks was 49 percent, which is higher than the accuracy without using quotation marks, which was 45.8 percent, that is, the performance of using quotation marks is better. The forth experiment depended on the finding of the second manual experiments, as it was found that replacing the question words (Who, Where, Which, What/This) in the queries with their choices could improve the correct answer selection, in addition to using the quotation marks. Accordingly, different programs were used. The final results from QueryPost3 program are illustrated in Table 5.5.
42
Chapter 5: Results and Discussion
Question type
Which Who This What Where Other Total
No of correct answers without using quotation 80 69 & 82 8 28 267
No of correct answers using quotation
No of questions in each category
85 77 87 9 33 291
Accuracy for each category using quotation
175 142 198
Accuracy for each category without using quotation 45.7% 48.6% 41.4%
21 64 600
38.1% 43.8% 44.5%
42.9% 51.6% 48.5%
48.6% 54.2% 43.9%
Table 5. 5 Number of correct answers for every question type, besides its accuracy with and without using quotation and the overall accuracy in Forth Automated Experiment
60.00%
50.00%
Accuracy
40.00% Accuracy for each category without quotation
30.00%
Accuracy for each category using quotation
20.00%
10.00%
0.00% Which
Who
This & What
Where
Other
Total
Question Type
Figure 5. 6 The accuracy for each question type with and without quotation
As shown in Figure 5.6, the highest accuracy occurs with the ‘Who’ questions either with or without using quotation marks, by 54.2 percent and 48.6 percent, respectively. This is because the answer for this question is a name of a person (which can be found
43
Chapter 5: Results and Discussion more easily by quoting the whole name). The accuracy decreased slightly in the ‘Which’ question: to 48.6 percent with using quotation marks and to 45.7 percent without quotation marks. Moreover, the accuracy for using quotation marks declined to its lowest rate in the ‘Where’ questions with 42.9 percent and 38.1 percent, respectively, without the quotation marks. This could be a result of the data size in this question type (21 questions), but it increased in the ‘What’/’This’ questions to 41.4 percent without using quotation marks, and it also increased to 43.9 percent with quotation marks. Thus, the best performance was in the ‘Who’ and ‘Which’ questions and the worst was in the ‘Where’ questions. Furthermore, the results were analysed according to the question categories and the question types as can be seen in Table 5.6:
44
Chapter 5: Results and Discussion
Question category
Question type
Entertainment
Which Who This / What Where Other Total Which Who This/ What Where Other Total Which Who This/ What Where Other Total Which Who This/ What Where Other Total Which Who This/ What Where Other Total Which Who This/ What Where Other Total
Geography
History
Literature
Science /Nature
Sport
No of correct answers without quotation 6 4 14 0 2 26 24 0 32 2 2 60 5 20 7 0 2 34 7 37 7 1 11 63 10 2 18 1 8 39 28 6 4 4 3 45
No of correct answers with quotation 8 6 11 0 3 28 25 0 34 2 2 63 7 17 9 0 6 39 9 44 9 1 11 74 8 4 18 1 9 40 28 6 6 5 2 47
No of Question
Accuracy without quotation
Accuracy with quotation
27 18 51 0 4 100 40 0 49 4 7 100 18 42 30 0 10 100 13 59 13 1 14 100 24 7 43 3 23 100 53 16 12 13 6 100
22% 22% 27% 0% 50% 26% 60% 0% 65% 50% 29% 60% 28% 48% 23% 0% 20% 34% 54% 63% 54% 100% 79% 63% 42% 29% 42% 33% 35% 39% 53% 38% 33% 31% 50% 45%
30% 33% 22% 0% 75% 28% 63% 0% 69% 50% 29% 63% 39% 40% 30% 0% 60% 39% 69% 75% 69% 100% 79% 74% 33% 57% 42% 33% 39% 40% 53% 38% 50% 38% 33% 47%
Table 5. 6 Number of correct answers for every question type in each category, besides their accuracy with and without using quotation and the overall accuracy
45
Chapter 5: Results and Discussion
Total
Sport
Other Where This & What Who Which
Science /Nature
Total Other Where This & What Who Which Total Literature
Where This & What Who Which
Accuracy with quotation Accuracy without quotatin
Total Other History
Quetion Category & Question Type
Other
Where This & What Who Which Total
Geography
Other Where This & What Who Which
Entertainment
Total Other Where This & What Who Which 0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Accuracy
Figure 5. 7 The accuracy for every question type in each category with and without using quotation
In Figure 5.7 all the techniques applied are summarized, such as using quotation marks and replacing the question words with the choices in every question category. As a result, the highest performance was in the literature and geography categories, as they achieved 74 percent and 63 percent of accuracy by using both the aforementioned techniques. In contrast, the lowest performance was for the entertainment questions,
46
Chapter 5: Results and Discussion with 28 percent accuracy. This variation of accuracy between the question categories is a result of the type of information that is available on the WWW. Thus, according to these experiments, it can be concluded that the WWW contains more information about literature and geography and less information about entertainment, which makes these techniques unable to answer correctly a question such as: Who won the Oscar for 2000 for Best Actor a. Tom Hanks b. Ed Harris c. Geoffrey Rush d. Russell Crowe answer: d
5.2 Future Work During the manual experiments, it was found that choosing query keywords such as the date or the year, and using the surname only instead of the whole name for people, may enhance the performance of the QA system in finding the correct answer. However, to apply these ideas automatically, we need more time and knowledge in the natural language processing.
47
Chapter 6: Conclusions
Chapter 6: Conclusions
Recently, there are expansion and dissemination of knowledge and information and their utilisation. Yet information has becomes more accessible through the World Wide Web. The increased access of the web and the growth of users and their diverse background and objectives required more accurate and effective search over the web. Consequently, many techniques have been developed responding to the need to provide a reliable answer of their search. Question Answering (QA) technology is a very important field in Information Retrieval, since answering specific questions correctly is becoming a vital way to deal with the ever increasing volumes of information available as free text. This technology, however, faces some obstacles in its development. This project investigated different techniques to automatically find the correct answer for multiple choice questions. These techniques prove our hypothesis that the correct answer for a multiple choice question would have the higher hit rate in Google than the wrong answers when submitted along with the question. Manual experiments were initially done on a small set of questions (20 questions). These questions were then sent to Google by combining them with their choices in finding the hit rate for each choice, and selecting the choice with the highest hit rate as the answer. As a result, 45 percent accuracy was achieved. In spite of the automated implementation of the previous method on a large set of data (600 questions), which is categorised according to the question domain into 6 categories (entertainment, geography, history, literature, science/nature and sport), the same accuracy rate, 45.8 percent, with a slight increase was obtained. Accordingly, the manual experiments were revisited to investigate other techniques that could enhance the rate of accuracy. It was found that in the experiments in which the query was modified by placing quotation marks around each choice and replacing the question words (Who, Which, Where, What/This) with the choices, achieved an improvement in the accuracy rate to 95 percent. Therefore, these techniques were applied automatically on a reasonable scale to see the improvements that could be made by identifying the question types in each category and quoting the choices. Hence, it was found that the accuracy varied from 48.5 percent to 49 percent. This accuracy increased to 63 percent and 74 percent in some question categories, such as in the geography and literature questions.
48
Chapter 6: Conclusions Although the overall accuracy for these techniques did not achieve the same results as some other QA systems that achieved more than 70 percent (Voorhees, 2004). This was, however, better than most of the systems that participated in the TREC 2005 QA track. The investigated techniques received from 48.5 percent to 49 percent accuracy, which is higher than the median in that track, which was 15.2 percent according to Table 6.1. This table shows the accuracy for the participant systems in TREC 2005 QA track, was taken from Greenwood (2006). System categories
Accuracy
Best systems
71.3%
Median system
15.2%
Worst systems
1.4%
Table 6. 1 Evaluation for the participant systems in TREC 2005 QA track
Moreover, the comparison between our techniques and these systems is not fair because the questions in this set of questions were much longer than the questions in the TREC 2005 QA track data set. These systems are a result of research projects that have more human resources, larger funding, and they spend a lot of time creating them. In contrast, the performance of our techniques are not time consuming, require less programming and are better than most of the other systems. Overall, this project proved that rather simple techniques can be applied to obtain respectable results. Actually, the project does not solve the problem, yet the remarkable results illustrated an improvement in the baseline in comparison to other systems.
49
Bibliography
Bibliography
Belkin, N. J. and Vickery, A. (1985) Interaction in Information Systems: a Review of Research from Document Retrieval to Knowledge-Based Systems. London: The British Library. Breck, E. J., Burger, J. D., Ferro, L., Hirschman, L., House, D., Light, M. and Mani, I. (2000) How to Evaluate Your Question Answering System Every Day and Still Get Real Work Done. Proceedings of LREC-2000, Second International Conference on Language Resources and Evaluation. Greece: Athens. [online].Available: http://www. cs.cornell.edu/~ebreck/publications/docs/breck2000.pdf [1/5/2006]. Briscoe, E. and Carroll, J. (2002) Robust Accurate Statistical Annotation of General Text. Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002). Canary Islands: Las Palmas. pp. 1499-1504. Briscoe, E. and Carroll, J. (2003) RASP System Demonstration. [online]. Available: http://www.informatics.sussex.ac.uk/research/nlp/rasp/offline-demo.html [20/7/2006] Burger, J., Cardie, C., Chaudhri, V., Gaizauskas, R., Israel, D., Jacquemin, C., Lin, C.Y., Maiorano, S., Miller, G., Moldovan, D., Ogden, B., Prager, J., Riloff, E., Singhal, A., Shrihari, R., Strzalkowski, T., Voorhees, E. and Weischedel, R. (2001) Issues, Tasks and Program Structures to Roadmap Research in Question & Answering (Q&A). [online]. Available: http://www-nlpir.nist.gov/projects/duc/papers /qa.Roadmap-paper_v2.doc[8/5/2006] Chomsky, C., Green, B. F., Laughery, K. and Wolf A. K. (1961) BASEBALL: An Automatic Question Answerer. In Proceedings of the Western Joint Computer Conference, volume 19, pages 219–224. Chowdhury, G. G. (1999) Introduction to Modern Information Retrieval. London: Library Association. Driscoll, J. and Wendlandt, E. (1991) Incorporating a Semantic Analysis into a Document Retrieval Strategy. In proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 270279
50
Bibliography Etzioni, O., Kwok, C. and Weld, D.S. (2001) Scaling Question Answering to the Web. Proc WWW10. Hong Kong. [online]. Available: http://www.iccs.inf.ed.ac.uk/~s02395 48/qa-group/papers/kwok.2001.www.pdf [1/5/2006]. Ferret, O., Grau, B., Hurault-Plantet, M., Illouz, G. and Jacquemin, C. (2001) Document Selection Refinement Based on Linguistic Features for QALC, a Question Answering System. In Proceedings of RANLP–2001. Bulgaria: Tzigov Chark. [online]. Available:http://www.limsi.fr/Individu/mhp/QA/ranlp2001.pdf#search= %22DOCUMENT%20SELECTION%20REFINEMENT%20BASED%20ON%22 [2/8/2006] Gabbay, I. (2004) Retrieving Definitions from Scientific Text in the Salmon Fish Domain by Lexical Pattern Matching. MA, University of Limerick. [online]. Available:http://etdindividuals.dlib.vt.edu:9090/archive/00000114/01/chapter2.pdf [2/8/2006]. Gaizauskas, R. and Hirschman, L. (2001) Natural Language Question Answering: The View from Here. Natural Language Engineering 7 (4).pp.275-300. Greenwood, M.A. (2005) Open–Domain Question Answering. PhD, University of Sheffield. [online].Available: http://www.dcs.shef.ac.uk/intranet/ research/phdtheses/ Greenwood2006.pdf [1/3/2006]. Greenwood, M.A. (2006) Question Answering at TREC. Natural Language Processing Group, Department of Computer Science, University of Sheffield. [online]. Available: http://nlp.shef.ac.uk/talks/greenwood_20060117.ppt#38 [15/8/2006]. Joho, H. (1999) Automatic Detection of Descriptive Phrases for Question Answering System: A Simple Pattern Matching Approach. MSc, University of Sheffield. [online]. Available: http://dis.shef.ac.uk/mark/cv/publications/dissertations/Joho1999.pdf [1/5/2006]. Katz, B., Lin, J. and Felshin, S. (2001) Gathering Knowledge for a Question Answering System from Heterogeneous Information Sources. In Proceedings of the ACL 2001 Workshop on Human Language Technology and Knowledge Management. France: Toulouse. [online]. Available: http://groups.csail.mit.edu/infolab/publications/ Katz-etal-ACL01.pdf [1/5/2006]. Korfhage, R. R. (1997) Information Storage and Retrieval. Canada: John Wiley & Sons, Inc.
51
Bibliography Levene, M. (2006) An Introduction to Search Engines and Web Navigation. Harlow: Addison-Wesley. Liddy, E. D. (2001) When You Want an Answer, Why Settle for A List?.Center for Natural Language Processing, School of Information Studies, Syracuse University. [online]. Available:http://www.cnlp.org/presentations/slides/When_You_Want_ Answer.pdf#search=%22When_You_Want_Answer%22 [2/8/2006]. Monz, C. (2003) From Document Retrieval to Question Answering. PhD, University of Amsterdam. Pazzani, M. J. (1983) Interactive Script Instantiation. In Proceedings of the National Conference on Artificial Intelligence. Washington DC: Morgan Kaufmann.pp.320326. Salton, G. (1968) Automatic Information Organization and Retrieval. New York; Maidenhead: McGraw-Hill. Salton, G. and McGill, M. J. (1983) Introduction to Modern Information Retrieval. New York: McGraw-Hill. Sanka, A. (2005) Passage Retrieval for Question Answering. MSc, University of Sheffield.[online].Available:http://www.dcs.shef.ac.uk/intranet/teaching/projects/archi ve /msc2005/pdf/m4as1.pdf [31/7/2006]. Sargaison, M. (2003) Named Entity Information Extraction and Semantic Indexing Techniques for Improving Automated Question Answering Systems. MSc, University of Sheffield. [online]. Available:http://www.dcs.shef.ac.uk/intranet/teaching/projects /archive /msc2003/pdf/m2ms.pdf [31/7/2006]. Voorhees, E. M. (2004) Overview of the TREC 2004 Question Answering Track. In Proceedings of the 13th Text REtrieval Conference (TREC 2004). Maryland: Gaithersburg.[online].Available:http://trec.nist.gov/pubs/trec13/papers/QA.OVERVIE W.pdf [31/7/2006] Voorhees, E. M. (2005) Overview of TREC 2005. In Proceedings of the 14th Text REtrieval Conference (TREC 2005). Maryland: Gaithersburg. [online]. Available: http://trec.nist.gov/pubs/trec14/papers/OVERVIEW14.pdf [31/7/2006] Woods, W. (1973) Progress in Natural Language Understanding - An Application to Lunar Geology. In AFIPS Conference Proceedings, volume 42, pages 441–450.
52
Bibliography WWW1: Fun Trivia.com (2005) [online]. Available: http://www.funtrivia.com/ [25/2/2006]. WWW2: Meng, Z. (No Date) Question Answering. [online]. Available: http://www. csee.umbc.edu/~nicholas/691B/Sp05presentations/Question%20Answering.ppt#5 [15/2/2006]. WWW3: Text REtrieval Conference (TREC) QA Track (2005) [online]. Available: http://trec.nist.gov/data/qa.html [16/2/2006]. WWW4: Text REtrieval Conference (TREC) Overview (2004) [online]. Available: http://trec.nist.gov/overview.html [16/2/2006] WWW5: TREC-8 QA Data (2005) [online]. Available: http://trec.nist.gov/data/qa/ T8_QAdata/topics.qa_questions.txt [16/2/2006] WWW6: TREC 2004 QA Data (2005) [online]. Available: http://trec.nist.gov/data /qa/2004_qadata/QA2004_testset.xml [16/2/2006] WWW7: Robust Accurate Statistical Parsing (RASP) (No Date). [online]. Available: http://www.informatics.susx.ac.uk/research/nlp/rasp/ [20/7/2006] WWW8: UCREL CLAWS2 Tagset (No Date) [online]. Available: http://www.comp .lancs.ac.uk/computing/research/ucrel/claws2tags.html [20/7/2006]
53
Appendix A
Appendix A The 13 conceptual question categories used in Wendy Lehnert’s QUALM Question Categories 1. Causal Antecedent
2. Goal Orientation
3. Enablement 4. Causal Consequent
5. Verification
6. Disjunctive 7. Instrumental/Procedural 8. Concept Completion
9. Expectational 10. Judgmental
11. Quantification
12. Feature Specification
13. Request
Examples Why did John go to New York? What resulted in John’s leaving? How did the glass break? For what purposes did John take the book? Why did Mary drop the book? Mary left for what reason? How was John able to eat? What did John need to do in order to leave? What happened when John left? What if I don’t leave? What did John do after Mary left? Did John leave? Did John anything to keep Mary from leaving? Does John think that Mary left? Was John or Mary here? Is John coming or going? How did John go to New York? What did John use to eat? How do I get to your house? What did John eat? Who gave Mary the book? When did John leave Paris? Why didn’t John go to New York? Why isn’t John eating? What should John do to keep Mary from leaving? What should John do now? How many people are there? How ill was John? How many dogs does John have? What color are John’s eyes? What breed of dog is Rover? How much does that rug cost? Would you pass the salt? Can you get me my coat? Will you take out the garbage?
54
Appendix B
Appendix B Example for Entertainment Question: qID= 2 Who was the first U.S. President to appear on television a. Franklin Roosevelt b. Theodore Roosevelt c. Calvin Coolidge d. Herbert Hoover answer: a Example for History Question: qID= 5 Who was the first English explorer to land in Australia a. James Cook b. Matthew Flinders c. Dirk Hartog d. William Dampier answer: d Example for Geography Question: qID= 1 What is the capital of Uruguay a. Assuncion b. Santiago c. La Paz d. Montevideo answer: d Example for Literature Question: qID= 1 Who wrote 'The Canterbury Tales' a. William Shakespeare b. William Faulkner
55
Appendix B c. Christopher Marlowe d. Geoffrey Chaucer answer: d Example for Science or Nature Question: qID= 1 Which of these programming languages was invented at Bell Labs a. COBOL b. C c. FORTRAN d. BASIC answer: b Example for Sport Question: qID= 1 Which of the following newspapers sponsored a major sports stadium/arena a. Chicago Tribune b. Arizona Republic c. Boston Globe d. St Petersburg Times answer: d
56
Appendix C
Appendix C These are the questions that we failed to get the right answers from the first manual experiment. Thus, in the second manual experiment we modified them by using important keywords in the question and combined them with each choice as the following: Question 4: In 1594, Shakespeare became an actor and playwright in what company? e. The King's Men f. Globe Theatre g. The Stratford Company h. Lord Chamberlain's Men The previous question had a wrong answer in the first experiment; therefore the query form was restructured and combined with each choice as follows: “In 1594” Shakespeare became an actor and playwright in “(one of the choices)” This query was combined with choice (a) and sent to Google as follows: “In 1594” Shakespeare became an actor and playwright in “The King's Men” This was repeated with each choice. In the previous example, keywords in the question were chosen as to improve the accuracy of the answer. The year was put in quotation marks as follows: “In 1594” It was noticed that the quotation marks enhanced the search results. The question word and the noun following the question word were replaced with a quoted choice as the possible answer, as follows: Question 5: What astronomer published his theory of a sun- centered universe in 1543? e. Johannes Kepler f. Galileo Galilei g. Isaac Newton h. Nicholas Copernicus This question was restructured by replacing “What astronomer” with a quoted choice and the year was also put in quotation marks, “in 1543”, as follows: “(one of the choices)” published his theory of a sun- centered universe “in 1543”
57
Appendix C Question 6: This Russian ruler was the first to be crowned Czar (Tsar) when he defeated the boyars (influential families) in 1547. Who was he? e. Nicholas II f. Ivan IV (the Terrible) g. Peter the Great h. Alexander I In question 6 “This Russian ruler” was replaced with a quoted choice and some important words such as “first”, “Czar” and “defeated the boyars” were used. Moreover, the year was put in quotation marks; hence the query is as follows: “(one of the choices)” was first Czar defeated the boyars “in 1547”
Question 7: Sir Thomas More, Chancellor of England from 1529-1532, was beheaded by what King for refusing to acknowledge the King as the head of the newly formed Church of England? a. James I b. Henry VI c. Henry VIII d. Edward VI The previous question was modified by using some keywords such as “Sir Thomas More” and “was beheaded by”; also, “what King” was replaced with a quoted choice as follows: Sir Thomas More was beheaded by “(one of the choices)”
Question 9: Which dynasty was in power throughout the 1500's in China? a. Ming b. Manchu c. Han d. Yuan In this question, “Which” was replaced with a quoted choice followed by “dynasty” as follows: “(one of the choices) dynasty” was in power throughout the 1500's in China
58
Appendix C
Question 11: Ponce de Leon, 'discoverer' of Florida (1513), was also Governor of what Carribean island? a. Cuba b. Puerto Rico c. Virgin Islands d. Bahamas In question 11, some keywords were chosen, such as “Ponce de Leon”,”was” and “Governor of” and “what Carribean island” was replaced with a choice as follows: Ponce de Leon was Governor of (one of the choices)
Question 14: Who did Queen Elizabeth feel threatened by and had executed in 1587? a. Jane Seymour b. Margaret Tudor c. Mary, Queen of Scots d. James I However, in the previous question “Who did” was deleted and other words were quoted, such as “Queen Elizabeth”. Also, “threatened by” was combined with a choice and was quoted as follows: “Queen Elizabeth” feel “threatened by (one of the choices)” and had executed in 1587
Question 15: What 16th century Cartographer and Mathematician developed a projection map representing the world in two dimensions using latitude and longitude? a. Ferdinand Magellan b. Gerardus Mercator c. Galileo Galilei d. Ptolemy The previous question was modified by replacing “What 16th century Cartographer and Mathematician” with the choice, or the surname for each choice with the whole name for a person , and quoting other important words such as “projection map”,” two dimensions” and “latitude and longitude”, as follows: (one of the choices) developed “ projection map” representing the world in “two dimensions” using “latitude and longitude”
59
Appendix C
Question 16: Which group controlled North Africa throughout most of the 16th century? a. French b. Spanish c. Egyptians d. Turks This question was the only question in which the right answer could not be obtained. The forms were modified several times by replacing “Which” and “Which group” with one of the choices. In addition, some search operators such as OR and AND were used, and a noun was added for each choice since they were adjectives. Also, each form was tried with and without quotation marks as follows: • • • • • • • •
(one of the choices) AND (the noun of the choices) group controlled North Africa throughout most of the 16th century (one of the choices) OR (the noun of the choices) group controlled North Africa throughout most of the 16th century “(one of the choices) AND (the noun of the choices)” group controlled North Africa throughout most of the 16th century “(one of the choices) OR (the noun of the choices)” group controlled North Africa throughout most of the 16th century (one of the choices) AND (the noun of the choices) controlled North Africa throughout most of the 16th century (one of the choices) OR (the noun of the choices) controlled North Africa throughout most of the 16th century “(one of the choices) AND (the noun of the choices)” controlled North Africa throughout most of the 16th century “(one of the choices) OR (the noun of the choices)” controlled North Africa throughout most of the 16th century
Question 17: Martin Luther started the Protestant Reformation in 1517 by criticizing the Roman Catholic Church. Where did he nail his 95 Theses? a. Sistine Chapel b. Wittenberg Cathedral c. St. Paul's Cathedral d. Louvre In the previous question, the question was modified as follows:
60
Appendix C Martin Luther nail his "95 Theses in the (one of the choices)" Question 20: What association, formed in the late 16th century, comprised of five Native American tribes? a. Hopi League b. Native American Council c. Sioux Confederation d. Iroquois League This question was modified by replacing “What association, formed in the late 16th century” with a quoted choice as follows: “(one of the choices)" comprised of five Native American tribes.
61
Appendix D
Appendix D QueryPost program code: import java.io.*; import java.util.*; import javax.swing.*; public class QueryPost{ public static void main(String rgs[]) throws Exception{ String Fname= JOptionPane.showInputDialog("Enter the file name: ");
BufferedReader inFile= new BufferedReader(new FileReader(Fname+".txt")); BufferedWriter outFile = new BufferedWriter(new FileWriter(Fname+"Without.txt")); BufferedWriter outFile2 = new BufferedWriter(new FileWriter("result"+Fname+".txt")); String line=null,qID=null,ans=null,tans=null,qtans=null; String[] choice= new String[4]; String[] query= new String[5]; double[] res= new double[5]; double answer=0; Google g = new Google(); outFile.write("qID********hits********a********b********c********d******** CorrectAnswer********answer"); outFile.newLine(); line=inFile.readLine(); while(line!=null) { while(!line.startsWith("")){ if(line.startsWith("qID=")){ StringTokenizer st=new StringTokenizer(line); if(st.nextToken().equals("qID=")){
62
Appendix D qID=st.nextToken(); } query[0]=inFile.readLine(); } if (line.startsWith("a.")){ choice[0]=line.substring(3); } if(line.startsWith("b.")){ choice[1]=line.substring(3); } if(line.startsWith("c.")){ choice[2]=line.substring(3); } if(line.startsWith("d.")){ choice[3]=line.substring(3); } if (line.startsWith("answer:")){ StringTokenizer st=new StringTokenizer(line); if(st.nextToken().equals("answer:")){ ans=st.nextToken(); } } line=inFile.readLine(); } line=inFile.readLine(); for(int i=0;i<4;i++) { query[i+1]=query[0]+" "+choice[i]; } for(int i=0;i
63
Appendix D
for(int i=0;i
64
Appendix D System.out.println(" The answer for this question is choice d "); tans="d"; else { System.out.println(" The answer for this question is choice c "); tans="c"; } } else if (res[2]
65
Appendix D } else if (res[1]
66
Appendix E
Appendix E An output sample of QueryPost program for sport questions: qID********hits********a********b********c********d********CorrectAnswe r********answer 1********282.0********23.0********15.0********44.0********64.0********d* *******d 2********8.0********-1.0********-1.0********-1.0********1.0********c********a 3********54.0********41.0********35.0********52.0********28.0********c** ******c 4********33.0********30.0********32.0********31.0********22.0********b** ******b 5********16100.0********10900.0********27600.0********44100.0********34 500.0********c********c 6********3.0********-1.0********-1.0********-1.0********1.0********d********a 7********366.0********291.0********360.0********307.0********127.0****** **b********b 8********-1.0********-1.0********-1.0********-1.0********1.0********b********b 9********12200.0********886.0********410.0********845.0********10900.0** ******c********d 10********93.0********89.0********47.0********81.0********34.0********a* *******a 11********85100.0********26800.0********23800.0********21600.0********2 4400.0********d********a 12********114000.0********96700.0********27800.0********45600.0******** 50200.0********a********a 13********28700.0********862.0********628.0********529.0********1.0********a********a 14********15200.0********596.0********548.0********809.0********654.0*** *****b********c 15********5.0********-1.0********-1.0********-1.0********1.0********c********a 16********24400.0********785.0********770.0********927.0********910.0*** *****c********c 17********540000.0********70600.0********67300.0********72500.0******** 62900.0********d********c
67
Appendix E 18********628000.0********11.0********110000.0********137000.0********1 83000.0********d********d 19********20000.0********943.0********15200.0********18500.0********132 00.0********c********c 20********15800.0********107.0********331.0********891.0********469.0*** *****d********c
68
Appendix F
Appendix F QueryPost2 program code: import java.io.*; import java.util.*; import javax.swing.*; public class QueryPost2{ public static void main(String rgs[]) throws Exception{ String Fname= JOptionPane.showInputDialog("Enter the file name: "); BufferedReader inFile= new BufferedReader(new FileReader(Fname+".txt")); BufferedWriter outFile = new BufferedWriter(new FileWriter(Fname+"Without.txt")); BufferedWriter outFile1 = new BufferedWriter(new FileWriter(Fname+"With.txt")); BufferedWriter outFile2 = new BufferedWriter(new FileWriter("result"+Fname+".txt")); String line=null,qID=null,ans=null,tans=null,qtans=null; String[] choice= new String[4]; String[] query= new String[9]; double[] res= new double[9]; double answer=0,qanswer=0; Google g = new Google(); outFile.write("qID********hits********a********b********c********d******** CorrectAnswer********answer"); outFile1.write("qID********hits********a********b********c********d******* *CorrectAnswer********answer"); outFile.newLine(); outFile1.newLine(); line=inFile.readLine(); while(line!=null) { while(!line.startsWith("")){ if(line.startsWith("qID=")){
69
Appendix F StringTokenizer st=new StringTokenizer(line); if(st.nextToken().equals("qID=")){ qID=st.nextToken(); } query[0]=inFile.readLine(); } if (line.startsWith("a.")){ choice[0]=line.substring(3); } if(line.startsWith("b.")){ choice[1]=line.substring(3); } if(line.startsWith("c.")){ choice[2]=line.substring(3); } if(line.startsWith("d.")){ choice[3]=line.substring(3); } if (line.startsWith("answer:")){ StringTokenizer st=new StringTokenizer(line); if(st.nextToken().equals("answer:")){ ans=st.nextToken(); } } line=inFile.readLine(); } line=inFile.readLine(); for(int i=0;i<4;i++) { query[i+1]=query[0]+" "+choice[i]; } for(int i=0;i<4;i++) { query[i+5]=query[0]+" "+"\""+choice[i]+"\""; } for(int i=0;i
70
Appendix F System.out.println("Google Search Results for: "+query[i]+" "+ res[i]+ " documents"); outFile2.write("Google Search Results for: "+query[i]+" "+ res[i]+ " documents"); outFile2.newLine(); outFile2.write("======================"); outFile2.newLine(); } if (res[0] == -1) { System.out.println(" No hit for this question, so there is not an answer"); outFile2.write(" No hit for this question, so there is not an answer"); outFile2.newLine(); } else { answer=res[1]; System.out.println(" answer = first result"+res[1]); if (res[1]
71
Appendix F System.out.println(" The answer for this question is choice d "); tans="d"; } else { System.out.println(" The answer for this question is choice b "); tans="b"; } } else if (res[1]
72
Appendix F if (res[5]
73
Appendix F System.out.println(" The answer for this question is choice d "); qtans="d"; } else { System.out.println(" The answer for this question is choice c "); qtans="c"; } } else if (res[5]
74
Appendix G
Appendix G QueryFilter program code: import java.io.*; import java.util.*; import javax.swing.*; public class QueryFilter{ public static void main(String rgs[]) throws Exception{ String Fname= JOptionPane.showInputDialog("Enter the file name: "); BufferedReader inFile= new BufferedReader(new FileReader(Fname+".txt")); BufferedWriter outFile1 = new BufferedWriter(new FileWriter(Fname+"WhoQuery.txt")); BufferedWriter outFile2 = new BufferedWriter(new FileWriter(Fname+"WhereQuery.txt")); BufferedWriter outFile3 = new BufferedWriter(new FileWriter(Fname+"WhichQuery.txt")); BufferedWriter outFile4 = new BufferedWriter(new FileWriter(Fname+"ThisWhatQuery.txt")); BufferedWriter outFile5 = new BufferedWriter(new FileWriter(Fname+"OtherQuery.txt")); String line=null,qID=null; String[] query= new String[1]; line=inFile.readLine(); System.out.println(line); while(line!=null) { System.out.println(line); if(line.startsWith("qID=")){ qID=line; System.out.println(qID); line=inFile.readLine(); System.out.println(line); StringTokenizer st=new StringTokenizer(line); if(line.startsWith("Who ")){
75
Appendix G System.out.println("Who if "); query[0]=line.substring(4); outFile1.write(qID); outFile1.newLine(); outFile1.write(query[0]); outFile1.newLine(); for(int i=0;i<6;i++) { outFile1.write(inFile.readLine()); outFile1.newLine(); } } else if(line.startsWith("Where ")){ System.out.println("Where if "); query[0]=line.substring(6); outFile2.write(qID); outFile2.newLine(); outFile2.write(query[0]); outFile2.newLine(); for(int i=0;i<6;i++) { outFile2.write(inFile.readLine()); outFile2.newLine(); } } else if(line.startsWith("Which ")){ System.out.println("Which if "); query[0]=line.substring(6); outFile3.write(qID); outFile3.newLine(); outFile3.write(query[0]); outFile3.newLine(); for(int i=0;i<6;i++) { outFile3.write(inFile.readLine()); outFile3.newLine(); } } else if(line.startsWith("This ")){ System.out.println("This if ");
76
Appendix G query[0]=line.substring(5); outFile4.write(qID); outFile4.newLine(); outFile4.write(query[0]); outFile4.newLine(); for(int i=0;i<6;i++) { outFile4.write(inFile.readLine()); outFile4.newLine(); } } else if(line.startsWith("What ")){ System.out.println("What if "); query[0]=line.substring(5); outFile4.write(qID); outFile4.newLine(); outFile4.write(query[0]); outFile4.newLine(); for(int i=0;i<6;i++) { outFile4.write(inFile.readLine()); outFile4.newLine(); } } else { System.out.println("Other "); query[0]=line; outFile5.write(qID); outFile5.newLine(); outFile5.write(query[0]); outFile5.newLine(); for(int i=0;i<6;i++) { outFile5.write(inFile.readLine()); outFile5.newLine(); } } } line=inFile.readLine();
77
Appendix G } outFile1.close(); outFile2.close(); outFile3.close(); outFile4.close(); outFile5.close(); } }
78
Appendix H
Appendix H QueryPost3 program code: import java.io.*; import java.util.*; import javax.swing.*; public class QueryPost3{ public static void main(String rgs[]) throws Exception{ String Fname= JOptionPane.showInputDialog("Enter the file name: "); BufferedReader inFile= new BufferedReader(new FileReader(Fname+".txt")); BufferedWriter outFile = new BufferedWriter(new FileWriter(Fname+"Without.txt")); BufferedWriter outFile1 = new BufferedWriter(new FileWriter(Fname+"With.txt")); BufferedWriter outFile2 = new BufferedWriter(new FileWriter("result"+Fname+".txt")); String line=null,qID=null,ans=null,tans=null,qtans=null; String[] choice= new String[4]; String[] query= new String[9]; double[] res= new double[9]; double answer=0,qanswer=0; Google g = new Google(); outFile.write("qID********hits********a********b********c********d******** CorrectAnswer********answer"); outFile1.write("qID********hits********a********b********c********d******* *CorrectAnswer********answer"); outFile.newLine(); outFile1.newLine(); line=inFile.readLine(); while(line!=null) { while(!line.startsWith("")){ if(line.startsWith("qID=")){
79
Appendix H StringTokenizer st=new StringTokenizer(line); if(st.nextToken().equals("qID=")){ qID=st.nextToken(); } query[0]=inFile.readLine(); } if (line.startsWith("a.")){ choice[0]=line.substring(3); } if(line.startsWith("b.")){ choice[1]=line.substring(3); } if(line.startsWith("c.")){ choice[2]=line.substring(3); } if(line.startsWith("d.")){ choice[3]=line.substring(3); } if (line.startsWith("answer:")){ StringTokenizer st=new StringTokenizer(line); if(st.nextToken().equals("answer:")){ ans=st.nextToken(); } } line=inFile.readLine(); } line=inFile.readLine(); for(int i=0;i<4;i++) { query[i+1]=choice[i]+" "+query[0]; } for(int i=0;i<4;i++) { query[i+5]="\""+choice[i]+"\""+" "+query[0]; } for(int i=0;i
80
Appendix H System.out.println("Google Search Results for: "+query[i]+" "+ res[i]+ " documents"); outFile2.write("Google Search Results for: "+query[i]+" "+ res[i]+ " documents"); outFile2.newLine(); outFile2.write("======================"); outFile2.newLine(); } if (res[0] == -1) { System.out.println(" No hit for this question, so there is not an answer"); outFile2.write(" No hit for this question, so there is not an answer"); outFile2.newLine(); } else { answer=res[1]; System.out.println(" answer = first result"+res[1]); if (res[1]
81
Appendix H System.out.println(" The answer for this question is choice d "); tans="d"; } else { System.out.println(" The answer for this question is choice b "); tans="b"; } } else if (res[1]
82
Appendix H if (res[5]
83
Appendix H System.out.println(" The answer for this question is choice d "); qtans="d"; } else { System.out.println(" The answer for this question is choice c "); qtans="c"; } } else if (res[5]
84
Appendix I
Appendix I An output sample of the RASP system, which is a text file and all of its words are tagged with special PoS: qID=_NN1 3_MC NBC_NP1 broadcast_VVD the_AT first_MD sportscast_NN1 of_IO this_DD1 game_NN1 in_II 1939_MC a._NNU Wrestling_VVG b._&FW Boxing_NN1 c._RR Football_NN1 d._NNU Baseball_NN1 answer_NN1 :_: d_NN2 _( qID=_NN1 14_MC What_DDQ genre_NN1 is_VBZ the_AT 1993_MC movie_NN1 'Blue_NP1 '_$ a._NN1 Comedy_NP1
85
Appendix I b._&FW Horror_NN1 c._RR Documentary_JJ d._NNU Western_JJ answer_NN1 :_: c_ZZ1 _) qID=_NN1 17_MC What_DDQ is_VBZ the_AT first_MD name_NN1 of_IO the_AT lead_NN1 singer_NN1 in_II the_AT band_NN1 'Little_NP1 Crunchy_NP1 Blue_NP1 Things_NP1 '_$ a._NN1 Hunter_NP1 b._&FW Eric_NP1 c._RR Noah_NP1 d._NNU Brian_NP1 answer_NN1 :_: c_ZZ1 _)
86
Appendix J
Appendix J These are the noun tags according to the 155 CLAWS-2 part-of-speech (PoS) which were taken from (WWW8) Tag Tag type ND1 NN NN1 NN1$ NN2 NNJ NNJ1 NNJ2 NNL NNL1 NNL2 NNO NNO1 NNO2 NNS NNS1 NNS2 NNSA1 NNSA2 NNSB NNSB1 NNSB2 NNT NNT1 NNT2 NNU NNU1 NNU2 NP NP1 NP2 NPD1 NPD2 NPM1 NPM2
singular noun of direction (north, southeast) common noun, neutral for number (sheep, cod) singular common noun (book, girl) genitive singular common noun (domini) plural common noun (books, girls) organization noun, neutral for number (department, council, committee) singular organization noun (Assembly, commonwealth) plural organization noun (governments, committees) locative noun, neutral for number (Is.) singular locative noun (street, Bay) plural locative noun (islands, roads) numeral noun, neutral for number (dozen, thousand) singular numeral noun (no known examples) plural numeral noun (hundreds, thousands) noun of style, neutral for number (no known examples) singular noun of style (president, rabbi) plural noun of style (presidents, viscounts) following noun of style or title, abbreviatory (M.A.) following plural noun of style or title, abbreviatory preceding noun of style or title, abbr. (Rt. Hon.) preceding sing. noun of style or title, abbr. (Prof.) preceding plur. noun of style or title, abbr. (Messrs.) temporal noun, neutral for number (no known examples) singular temporal noun (day, week, year) plural temporal noun (days, weeks, years) unit of measurement, neutral for number (in., cc.) singular unit of measurement (inch, centimetre) plural unit of measurement (inches, centimetres) proper noun, neutral for number (Indies, Andes) singular proper noun (London, Jane, Frederick) plural proper noun (Browns, Reagans, Koreas) singular weekday noun (Sunday) plural weekday noun (Sundays) singular month noun (October) plural month noun (Octobers)
87
Appendix K
Appendix K NounRemove program code: import java.io.*; import java.util.*; import javax.swing.*; public class NounRemove{ public static void main(String rgs[]) throws Exception{ String Fname= JOptionPane.showInputDialog("Enter the file name: "); BufferedReader inFile= new BufferedReader(new FileReader(Fname+".txt")); BufferedWriter outFile = new BufferedWriter(new FileWriter(Fname+"NRemoved.txt")); String line=null,qID=null; line=inFile.readLine(); while(line!=null) { System.out.println(line); if(line.startsWith("qID=")){ qID=inFile.readLine(); outFile.write("qID="+qID); outFile.newLine(); System.out.println(qID); line=inFile.readLine(); System.out.println(line); if((line.startsWith("This"))||(line.startsWith("What"))){ line=inFile.readLine(); while (line.endsWith("_NN1")){ line=inFile.readLine(); System.out.println(line); } } while(!line.startsWith("")){
88
Appendix K outFile.write(line); outFile.newLine(); System.out.println(line); line=inFile.readLine(); } outFile.write(line); outFile.newLine(); } line=inFile.readLine(); } outFile.close(); } }
89