Thesis

  • July 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Thesis as PDF for free.

More details

  • Words: 9,854
  • Pages: 51
ANOMALY DETECTION FROM AVIATION SAFETY REPORTS by Suraj Raghuraman

APPROVED BY SUPERVISORY COMMITTEE:

___________________________________________ Dr. Vincent Ng, Chair

___________________________________________ Dr. Yang Liu

___________________________________________ Dr. Latifur Khan

! Copyright 2008 Suraj Raghuraman All Rights Reserved

To my late Grandfather & Family To my advisor, Dr. Vincent Ng To all my friends

ANOMALY DETECTION FROM AVIATION SAFETY REPORTS by SURAJ RAGHURAMAN, B.E.

THESIS Presented to the Faculty of The University of Texas at Dallas in Partial Fulfillment of the Requirements for the Degree of

MASTER OF SCIENCE IN COMPUTER SCIENCE

THE UNIVERSITY OF TEXAS AT DALLAS December, 2008

ACKNOWLEDGEMENTS I would like to thank my advisor Dr. Vincent Ng for putting up with me. I would have given up long before if it has not been for him providing support and motivation. He some how knew what would motivate me and did his best to get me going. I would like to thank my family for motivating me enough to study abroad. I would still be working in India if it weren’t for them. Thanks for supporting me when I needed your support the most. I would like to thank my friends for keeping me well fed and happy during the bad times. Motivating me hard enough for me to finish my thesis. If it weren’t for you guys I would have never thought of finishing my Master’s in a year. September, 2008

v

ANOMALY DETECTION FROM AVIATION SAFETY REPORTS Publication No. ___________________ Suraj Raghuraman, M.S. The University of Texas at Dallas, 2008

Supervising Professor:

Dr. Vincent Ng ABSTRACT

Anomaly detection from Aircraft Safety Reporting System (ASRS) reports is the task of identifying the multiple types of anomalies that resulted in an aviation incident based on the narrative portion of the report, which was written with lot of domain specific terminology. The thesis proposes an effective semi-supervised approach to solving the problem without using any domain related information. The approach relies on methods similar to singular value decomposition and latent semantics for feature generation. A Support Vector Machine with transduction was used to classify the narratives. Experimental results show that the method developed was marginally better than the existing methods for high frequency anomalies and substantially better for low frequency anomaly on the ASRS dataset.

vi

TABLE OF CONTENTS ACKNOWLEDGEMENTS ............................................................................................... v ABSTRACT .......................................................................................................................vi LIST OF TABLES .............................................................................................................ix LIST OF FIGURES............................................................................................................. x CHAPTER 1 INTRODUCTION...................................................................................... 1 CHAPTER 2 PROBLEM DESCRIPTION ...................................................................... 3 CHAPTER 3 RELATED WORK..................................................................................... 5 3.1 Supervised Approach ........................................................................................ 5 3.1.1 Dynamic Template Rule.................................................................... 5 3.1.2 Naïve Bayes Classifier ...................................................................... 5 3.1.3 Support Vector Machines ................................................................... 6 3.2 Semi-supervised Approach................................................................................ 7 3.2.1 Naïve Bayes EM................................................................................ 7 CHAPTER 4 OUR SEMI-SUPERVISED APPROACH ................................................. 9 4.1 Introduction ...................................................................................................... 9 4.2 Data Analysis ................................................................................................. 10 4.3 Feature Generation ......................................................................................... 12 4.3.1 Content Extraction............................................................................ 12 4.3.2 Tokenization ..................................................................................... 12 4.3.3 Hash Generation ............................................................................... 13 4.4 Word Selection ................................................................................................ 14 4.4.1 Plain Information Gain ..................................................................... 15 4.4.2 Simple Subtraction ........................................................................... 15 4.4.3 Combined Metric.............................................................................. 15 4.5 Feature Reduction........................................................................................... 16 4.5.1 Term By Document Vector .............................................................. 16 vii

4.5.2 Inner Product .................................................................................... 17 4.5.3 Singular Value Decomposition ........................................................ 19 4.5.4 Gram Schmidt Process ..................................................................... 19 4.5.5 Applying Gram Schmidt .................................................................. 20 4.8 Classification ................................................................................................... 21 4.8.1 Learning............................................................................................ 22 4.8.2 Classifying........................................................................................ 22 4.8.3 Interpreting Output ........................................................................... 23 CHAPTER 5 EVALUATION ........................................................................................ 24 5.1 Dataset ............................................................................................................ 24 5.2 Experiment Setup ........................................................................................... 25 5.3 Baseline Experiments ..................................................................................... 25 5.3.1 Plain SVM ........................................................................................ 25 5.3.2 Naïve Bayes EM............................................................................... 27 5.3.2 Gram Schmidt Kernel....................................................................... 29 5.4 Our Semi Supervised Approach ..................................................................... 31 5.4.1 Stage 1: Word Selection ................................................................... 31 5.4.2 Stage 2: Further Dimensionality Reduction ..................................... 33 5.5 Result And Error Analysis ............................................................................. 35 CHAPTER 6 CONCLUSION ........................................................................................ 37 REFERENCES ................................................................................................................. 39 Vita

viii

LIST OF TABLES Number

Page

2.1 ANOMALY DETAILS ................................................................................................ 4 4.1 TERM BY DOCUMENT VECTOR.......................................................................... 17 4.2 TERM BY DOCUMENT VECTORS FOR S1, S2 AND S3 .................................... 18 5.1 DATASET COMPOSITIONS ................................................................................... 25 5.2 RESULTS OF PLAIN SVM ...................................................................................... 26 5.3 RESULTS OF NAÏVE BAYES EM .......................................................................... 28 5.4 RESULTS OF GSK.................................................................................................... 30 5.5 RESULTS OF STAGE 1............................................................................................ 33 5.6 RESULTS OF STAGE 2............................................................................................ 34 5.7 COMPARING THE F-SCORES................................................................................ 36

ix

LIST OF FIGURES Number

Page

2.1 AN EXAMPLE NARRATIVE .................................................................................... 3 4.1 NARRATIVE FOR ANOMALY, INCURSION – RUNWAY ................................. 10 4.2 NARRATIVE FOR ANOMALY, INFLIGHT ENCOUNTER – WEATHER ......... 11 4.3 TOKENIZED NARRATIVE FOR ANOMALY, INCURSION – RUNWAY ......... 13 4.4 HASHED VERSION OF TOKENIZED NARRATIVE............................................ 14 4.5 REDUCED HASHED NARRATIVE........................................................................ 16 4.6 SAMPLE TRAIN FILE CONTENT .......................................................................... 22 4.7 SAMPLE TEST FILE CONTENT............................................................................. 23 5.1 RESULTS OF PLAIN SVM ...................................................................................... 27 5.2 RESULTS OF NAÏVE BAYES EM .......................................................................... 29 5.3 RESULTS OF GSK.................................................................................................... 31 5.4 RESULTS OF STAGE 1............................................................................................ 33 5.5 RESULTS OF STAGE 2............................................................................................ 35 5.6 COMPARING ALL RESULTS ................................................................................. 36

x

CHAPTER 1 INTRODUCTION

Aircraft Safety Reports are a rich source of information on flight incident. Every time an aviation incident occurs any where around the world a report is written which contains the complete details of the incident. Over the years these reports from around the world were submitted to the ASRS archive at National Aeronautical and Space Agency (NASA). Over the decades the database has grown to contain over 150 thousand incident narratives. The task of anomaly detection is to determine the cause of the incident by just looking at the narrative portion of the report. There are about 62 anomaly classes in all. The documents come from varying backdrops, contain lots of grammatical mistakes, incorrect spelling, loads of conflicting abbreviations and are very hard for even an average person to understand making the task difficult. The goal of this thesis is to come up with a semi supervised approach for anomaly detection without relying on any knowledge of the aviation domain terminologies or procedures to solve the problem. We will start with the existing approaches that were applied on the dataset and approaches that were used for similar classification tasks. It is always wise to build upon the approaches based on which people are already working. So we start off with enhancing the word selection methods and getting better accuracies by using semi-supervised learning algorithms like support vector machines with transduction.

1

2 After that the focus shifts to using the entire dataset and not just the labeled instances to learn the features. For this we propose a new approach that uses similarity between documents to generate features. We use latent semantics and single value decomposition in order to get the similarity between the documents. While getting the similarity greater emphasis is given to the words that help identify the presence of the anomalies. The key idea that this thesis introduces is the usage of a set of documents to generate the features for a single document. Instead of just looking at the similarity between two documents and classifying based on just that, An instance selection algorithm is used to pick out the points of reference and then based on the similarity with these varying reference documents, patterns are extracted using SVM, which are used to classify documents into anomaly classes. In order to take greater advantage from the instances we have, SVM with transduction is used instead of a fully supervised SVM. Finally, we show that the word selection based methods that we developed perform better than the existing methods. The semantics based method doesn’t perform as well as the word selection based methods but performs far better than the approximated latent semantics based algorithms on this dataset.

CHAPTER 2 PROBLEM DEFINITION

Anomaly detection in Aircraft Safety Reporting System (ASRS) reports is the task of identifying the types of anomalies that resulted in an aviation incident based on the narrative portion of the report. Incidents can occur due to more than one reason and hence more than one anomaly can be present in a single narrative. The aviation officials write an ASRS report whenever an incident occurs. These reports are written all around the world and as a result contain varying style of writing. The content of the report is highly aviation domain specific and contains lots of misspelled words, acronyms, and grammatical mistakes. To get a clearer picture of how a report looks like, an example is given below.

ACR Y CLIMBING TO FL210 WAS STOPPED AT 160 FOR TFC AT 170; WHICH PASSED. ACR Y WAS THEN KEPT AT 160. ACR X WAS DESCENDED TO 170 UNTIL CLEAR OF TFC; THEN DESCENDED TO 150. ACR X WAS NOT OUT OF 170 WHEN CLRNC ISSUED TO STOP AT 170. NO ANSWER. CLRNC GIVEN AGAIN. NO ANSWER. TURNED ACR Y FROM 320 DEG TO 250 DEG HDG. CLRNC TO ACR X AGAIN WITH TURN TO 060 DEG; NO ANSWER. FINALLY ACR X ANSWERED AND TURNED TO 060 DEG WITH CLIMB BACK TOS 170. ACFT PASSED 4.3 MI AND 200' APART. ACR X DID NOT RESPOND TO CLRNC. POSSIBLE LACK OF COCKPIT DISCIPLINE.

Figure 2.1. An example narrative 3

4

As you can clearly see in Figure 2.1 most of the text is not very readable to a non-aviation expert. The other complication is the fact that most acronyms are 3 characters in length. Example ACR can stand for Aircraft as well as Air Control Radar, and it may also refer to an airport in Columbia. In addition, ACFT is the standard way of referring to an Aircraft. There are over 62 classes of anomalies that a narrative can belong to. For the sake of simplicity and to reduce the amount of processing the data needs to go through we will consider only 8 anomalies. In order not to loose out on the variations that each anomaly brings forth in terms of the frequency of occurrence and the situation in which they occur, the anomalies we choose for classification are listed in table below.

Table 2.1. Anomaly Details Class 1

Anomaly Aircraft Equipment Problem - Critical

2

Airspace Violation – Entry

3

Incursion – Runway

4

Inflight Encounter – Turbulence

5 6

Inflight Encounter – Weather Inflight Encounter – VFR IN IMC

7

Maintenance Problem – Improper maintenance Non Adherence - clearance

8

Description Some equipment on the aircraft has malfunctioned. Aircraft was flown into an unauthorized location. Incorrect presence of a vehicle, person or aircraft on the runway. Heavy turbulence was encountered during the flight. Severe weather was encountered while flying. Visual Flight Rules (VFR) into Instrument Meteorological Conditions (IMC) means unqualified flight into adverse weather. Some part of aircraft was not properly maintained. The Air Traffic Controller (ATC) did not clear pilot to perform the operation.

CHAPTER 3 RELATED WORK

3.1

Supervised Approaches

Supervised approaches use labeled documents to learn and then label the unlabeled documents. The next sections cover some of the supervised approaches used on the ASRS dataset. 3.1.1 Dynamic Template Rule System The basic approach that was tried out by [1] was based on regular expressions. The domain heavily influences this approach and works similar to how an aviation expert identifies anomalies from the narrative. After lot of brainstorming a set of shaper words that are useful in identifying the anomaly expressions, were created. These shaper words are used to identify expressions from a narrative. Regular expression templates are generated for particular types of anomalies, using these expressions rules. A narrative is run through these rule trees and depending on the matches of regular expression it is assigned anomalies. 3.1.2 Naïve Bayes Classifier Naïve Bayes is a probabilistic classifier; an anomaly c* is assigned to a narrative document d where Anomaly class c* = argmaxc P(c | d) We derive this P(c|d) using Bayes rule as follows: 5

6

P(d) doesn’t play any role in determining argmaxc P(c | d) so is ignored. Naïve Bayes decomposes the term P(d | c) into a set of features fi ‘s assuming they are conditionally independent. So

In spite of having such a big assumption of conditional independence between words, which clearly doesn’t hold in case of real world data Naïve Bayes is known to be good for text classification [2] and is sometimes optimal for highly dependent features [3]. 3.1.3 Support Vector Machine Support Vector Machines (SVM) is known to be very good at classifying text [4]. The classification task is split up into multiple binary classification tasks. In case of binary SVM the basic idea is to represent the input features in a high dimension feature space. Then it identifies a hyperplane that will separate the features by the highest possible margin. If we take

Where

to be the hyperplane then

,

are the support vectors and "j are obtained by solving the dual

optimization problem [5]. In [6] an SVM classifier was used on the ASRS dataset by taking words as features. In order to control the number of features given to the learner, Information Gain (IG) was used as a metric for choosing the words.

7 IG = H(C) – H(C|W) Where H(C) is the entropy of the anomaly class and H(W|C) is the entropy of the word given the class. Overall it was observed that the top 2500 words are more than sufficient to get good results. It was noticed that anomalies that occur less frequently result in bad F-Scores and perform better with fewer words. 3.2

Semi-Supervised Approaches

Semi-supervised approaches use a combination of labeled and unlabeled data to label the unlabeled documents. There are very few semi-supervised algorithms that have been applied to ASRS dataset. The approach mentioned below is good for text classification in general but never used on the ASRS dataset. 3.2.1 Naïve Bayes EM Expectation Maximization (EM) is a meta-learning algorithm, which combines both the labeled and unlabeled data. The algorithm works in two stages; in the first stage the unlabeled data is probabilistically labeled using the parameter values. Then in the second stage the value of the parameters are recalculated using maximum likelihood estimation. In [7] Naives Bayes is combined with EM to classify text. To classify (E-Step):

Where, is the probability that a document belongs to that class j.

8

is the prior probability of class j. is the probability of the kth word from the dith document occurring in the class j.

To re-estimate the parameters (M-Step): The function below is used to distinguish between the label and unlabeled documents.

Where, # is the weight factor for unlabeled documents Dl are the labeled documents Du are the unlabeled documents

For estimating the probabilities of a word to occur in a class.

N(ws , di ) is the frequency of the word s in the document di The class prior probabilities are computed using

Using the above set of expressions, good classification accuracy was achieved in their experiments. The best result is obtained when the value of # between 0 and 1. The value of # depends on the type of data classified.

CHAPTER 4 OUR SEMI-SUPERVISED APPROACH

4.1

Introduction

A good look into the narratives of the ASRS dataset reveals a lot of connections between few keywords and the anomalies associated with it. There are lots of words that will not occur with certain anomalies and there are lots of words that will occur only with certain anomalies. But this is not an easily exploitable feature, since a document can have multiple anomalies. These are the two key concepts, which the approach relies on for classification. To start with the task of anomaly detection is split into binary classification tasks of checking the presence of a single anomaly. For each of the binary tasks the following operations are performed. •

Feature Generation: Words are mapped as direct features.



Word Selection: Relevant features are selected using Information Gain metric.



Feature Reduction: Multiple Features are combined to form a feature using Singular Value Decomposition (SVD)[8] and Gram Schmidt process [9].



Classification: A transductive SVM is used for learning and classifying the feature.

Further sections detail each of the operations. We start by analyzing the data.

9

10

4.2

Data Analysis

Let’s start with taking a relook at the data. As mentioned before the data has lots of aspects like abbreviated words, misspelling, grammatical mistakes, domain specific words etc that make it hard for non aviation domain humans to understand, so there is no way that we can use any of the more advanced tools of Natural Language Processing (NLP) like part of speech taggers, syntactic parser, named entity recognizers etc readily. So we are left with two alternatives; either we byte the bullet and come up with a big task of fixing the text so that advanced NLP tools can be used; or we avoid using the tools and adopt a different approache to the problem, which is what most others seem to be doing. WHEN I CALLED GND CTL I STATED; 'WORCESTER GND (OUR ACFT IDENT) TAXI; VFR TO BRADLEY.' SHE RESPONDED BY SAYING; '(ACFT IDENT); TAXI VIA BRAVO FOR INTXN DEP 29.' SHE ALSO ASKED ME PREVIOUS TO THIS IF I REQUIRED FULL LENGTH. I RESPONDED BY SAYING; 'NEGATIVE.' SHE CAME BACK BY SAYING; 'BRAVO INTXN 29 FOR TKOF IS APPROVED.' I RESPONDED BY SAYING; 'CLRD TO GO AT 29?' SHE SAID; 'AFFIRMATIVE;' AFTER I SAID; 'CLRD TO GO 29.' WHEN I MADE MY 45 DEG LEFT TURN FOR DEP; I DECIDED TO CALL TWR FREQ TO ADVISE (WHICH IS NOT REQUIRED). WHEN I CALLED TWR FREQ; A MAN RESPONDED BY SAYING; '(ACFT IDENT); YOU WERE NEVER CLRD FOR TKOF.' THE NEXT DAY MY F/O (NAME) AND I VISITED THE WORCESTER TWR TO SPEAK TO THE CHIEF CTLR. WE LISTENED TO THE TAPE AND I WAS CORRECT; THAT THE GND CTLR ACKNOWLEDGED THAT I WAS 'CLRD TO GO' AT INTXN BRAVO 29. THE CHIEF CTLR SAID THERE WOULD BE NO ACTION TAKEN AGAINST ME AND MY F/O. THE CHIEF CTLR AND MYSELF AGREED THAT THE GND CTLR AND MYSELF USED POOR PHRASEOLOGY. FOR INSTANCE; 'CLRD TO GO;' WHICH IS USED QUITE OFTEN IN AVIATION FOR 'CLRD FOR TKOF.' I WOULD LIKE TO POINT OUT THAT NO ATTEMPT WAS MADE BY TWR OR GND FREQ TO CONTACT ME NOT TO TKOF AT BRAVO INTXN RWY 29. SUPPLEMENTAL INFORMATION FROM ACN 80338: CAPT FLYING ACFT AND OPERATING RADIOS. I ASKED CAPT IF SHE CLRD US FOR TKOF ON GND FREQ. HE SAID YES; AND THAT SHE WAS WORKING BOTH FREQS.

Figure 4.1. Narrative for Anomaly, Incursion – Runway

11 On closer examination of the narrative in Figure 4.1, it is easy to notice that lots of words only occur in certain anomaly cases. For example in the above case the word TAXI is mentioned that increases the chances of the anomaly being an incursion or excursion of the runway. Since it is easy for us to determine that by looking at the record, even machines can do that using simple measures like plain old information gain [6]. The other way of doing something similar would be to just use Inverse Term Frequency and then adding weight on words that are less frequent [10]. Now that we know what kind of words to look for, let us look at the way the words in text are arranged. As we look at more and more documents it starts to become clear that in most of the cases the order in which the words occur is not relevant. The other thing to notice is shaper words like TAXI are not mentioned when the anomaly is Weather (Figure 4.2) for instance. The more you look at the documents the more you would feel that a Bag of Word approach is the best way to move forward. That’s what this thesis ended up doing. AFTER 4 HRS FLT TIME ARRIVED AT SEA. WX HAD STARTED TO GO DOWN. ARPT USING ILS 34R APCH WITH A RVR OF 3000'. ATTEMPTED FIRST APCH AND INITIATED MISSED APCH AT DH WITH UNEVENTFUL GO AROUND. STARTED SECOND APCH; AT DH I SAW APCH LIGHTS. AT SAME TIME COPLT STARTED TO PUSH THROTTLE FORWARD FOR GO AROUND. I DID NOT REALIZE THAT THE TOGA BUTTONS HAD BEEN PUSHED. WHEN I GRABBED THE YOKE I DISCONNECTED THE AUTOPLT. AT 100' RWY CAME IN SIGHT AND WE WERE LINED UP A LITTLE RIGHT. AT THE SAME TIME I BECAME CONFUSED WITH THE THROTTLE RESPONSES BECAUSE I DID NOT REALIZE THE TOGA BUTTON HAD BEEN ACTIVATED. WE STARTED A LEFT DRIFT WHILE I TRIED TO TURN THE LIGHTS ON FOR BETTER DEFINITION OF THE RWY. AT ABOUT 50' THE DRIFT BECAME FASTER TO THE LEFT. WHEN EDGE LIGHTS CENTERED UP AND ALSO ILS TWR WAS SEEN; GO AROUND WAS INITIATED FOR AN UNEVENTFUL GO AROUND. ON GO AROUND IT WAS LEARNED THAT THE RVR HAD DROPPED TO 1000' ALL XMITTERS. LESSONS LEARNED: 1) A BETTER APCH BRIEF WITH COPLT KNOWING TO CALL DH OUT EXTRA LOUD SO IT IS DEFINITELY HEARD. 2) APCH SHOULD HAVE BEEN DISCONTINUED AS SOON AS THE AUTOTHROTTLES FELT FUNNY AND DEFINITELY WHEN THE RWY WAS NOT LINED UP. 3) ALSO BRIEF A POSITIVE GO AROUND CALLOUT AGAIN TO DISTRACT THE OUTSIDE CONCENTRATE. I THINK ANOTHER CONTRIBUTING FACTOR WAS THAT THE TIME WAS CLOSE TO 2 O'CLOCK BODY TIME AND CONCENTRATION WAS JUST A LITTLE SLOWER THAN NORMAL.

Figure 4.2. Narrative for Anomaly, Inflight Encounter – Weather

12 The other key thing to note is that many of the anomalies occur together, for example the anomalies 1(Aircraft Equipment Malfunction) and 7(Maintenance Problem) go hand in hand; this is a feature that is unique to this dataset and which many datasets will not share. As a result this thesis will not be exploiting this characteristic of the dataset. A Bayesian classifier can be used to make the predictions based on the narrative and anomalies identified, whether or not a different anomaly can also be associated with the current narrative. This could be something that could be tried out if the aim were to just classify this dataset.

4.3

Feature Generation

This section details about the how the narratives, where extracted from the ASRS dataset. The preprocessing and cleaning applied to the narratives. Finally how the words from the narrative are converted into features. 4.3.1 Content Extraction The ASRS document set used for this thesis was a set of comma separated values (csv) files that were taken from the years 1988 to 2007. Each of these files has thousands of narratives. There are about 54 fields in each of the file. The thesis will be using only two fields, the narrative and the event anomaly from the files. There are a total of more than 60 anomalies present in these files. In this stage we pick out all the narratives containing at least one of the anomalies mentioned in table 2.1. This step returns about 75000 narratives. 4.3.2 Tokenization This step involves converting the text as seen in Figure 4.1 to Figure 4.3. Tokenizing the text is very important in order to ensure that the words can be identified easily, since there are lots

13 of punctuations attached to the words of the narrative. The first step in tokenization is to separate the punctuations from the words. However, there are few places where punctuations indicate a measure of distance from example like in the case of 3000’, which shows that the distance is 3000 feet. These are the places where we cannot afford to strip or separate punctuations, due to the fact that this can have an adverse affect on the matching process. To handle such cases, punctuations following or contained within numbers, where not removed. WHEN I CALLED GND CTL I STATED WORCESTER GND OUR ACFT IDENT TAXI VFR TO BRADLEY SHE RESPONDED BY SAYING ACFT IDENT TAXI VIA BRAVO FOR INTXN DEP 29 SHE ALSO ASKED ME PREVIOUS TO THIS IF I REQUIRED FULL LENGTH I RESPONDED BY SAYING NEGATIVE SHE CAME BACK BY SAYING BRAVO INTXN 29 FOR TKOF IS APPROVED I RESPONDED BY SAYING CLRD TO GO AT 29 SHE SAID AFFIRMATIVE …

Figure 4.3. Portion of a tokenized Narrative for Anomaly, Incursion – Runway 4.3.3 Hash Generation The basic idea behind hash generation is to replace the use of words by numbers to facilitate the generation of the input files for the SVM learning algorithm. All the narratives are read and each word is assigned a number that will be its primary identity from now on. We are also not interested in the order in which words occur. All we are interested in is the number of times they occur in a narrative. So, to avoid repeated parsing of the sentence to get the frequency of occurrence of the words, take the primary key and associate to each word its normalized frequency. Specifically, Normalized Frequency =

Total occurrence of the word Total number of words in narrative

.

Finally, the result is sorted in ascending order of the primary key value. To exemplify, Figure 4.4 is a hashed version of Figure 4.3.

14 8:0.00574712643678161 10:0.028735632183908 17:0.0862068965517241 23:0.0229885057471264 33:0.0114942528735632 41:0.00574712643678161 58:0.0114942528735632 61:0.0229885057471264 77:0.00574712643678161 81:0.0229885057471264 84:0.0065359477124183 93:0.0065359477124183 95:0.0065359477124183 121:0.0065359477124183 125:0.0130718954248366 126:0.0065359477124183 128:0.0065359477124183 150:0.0065359477124183 155:0.0065359477124183 … Figure 4.4. Hashed version of tokenized Narrative for Anomaly, Incursion – Runway

4.4

Word Selection

There are over 20,000 of unique words in the ASRS dataset. Since we are considering words as features, it becomes a tall order for any classifier to classify using such a high number of features well. Even though the SVM learning algorithm is known to handle large feature sets very nicely, it is still better to reduce the number of features sent to it by looking at the kind of features provided to it by the method. After experimenting with several methods on the dataset we observed that the best results were obtained when only a selective subset of words from the narratives were used instead of all the words. Typically, the words that are selected either appear a lot in the narratives with a particular anomaly or in all but that anomaly. The Information Gain (IG) method for word selection used by the folks at NASA in [6] was found to be one of the best ways of doing word selection (see section 4.4.1). The key advantage of using IG is that it takes care of both of the cases mentioned above. Another method using subtraction was also added into the system to increase the accuracy for the words picked (see section 4.4.2).

15 4.4.1 Plain Information Gain IG is calculated as IG = H(anomaly class) – H(anomaly class | Feature) In this method, while calculating the entropy, normalized frequencies are not considered. All that is considered is the presence or absence of a feature in the document. Also, the features that occur only in a single document are ignored for being too rare. 4.4.2 Simple Subtraction One of the easiest ways to find out if a feature is more prevalent in a class than other classes is by doing a subtraction. But the problem with subtraction is that the range of values it returns can be huge. To fix this problem we use the log function. The log function brings back the numbers returned by subtracting into a manageable level. So basically the log of the difference in the frequencies of occurrence of a word in the class and outside it is what is used. 4.4.3 Combined Metric A combination of both word selection methods described above is used to eliminate the irrelevant features for an anomaly. The values for both IG and subtraction are calculated for all the features. Then two lists are formed, one with features arranged in descending order of IG and the other with subtraction. 75% of the required features are chosen from the top of the IG list and then the remaining 25% features are taken from the top of the subtraction list. After this all the unselected features are considered irrelevant and removed from both the training and test files. A result of word selection is shown in Figure 4.5.

16 17:0.0862068965517241 41:0.00574712643678161 58:0.0114942528735632 61:0.0229885057471264 77:0.00574712643678161 84:0.0065359477124183 93:0.0065359477124183 95:0.0065359477124183 125:0.0130718954248366 126:0.0065359477124183 155:0.0065359477124183 … Figure 4.5. Reduced hashed Narrative for Anomaly, Incursion – Runway

4.5

Feature Reduction

In many classification tasks, the feature values for an instance are typically generated from the instance itself. The less common approach is to use multiple instances to generate the features for a given instance. One such approach is uses latent semantics kernels [11]. Our feature reduction approach is strongly influenced by latent semantics. The section starts with the basic terminologies and then continues on to explain how and why using the method for feature reduction is a good idea. 4.5.1 Term By Document Vector A document can be represented in many different ways. One of the more mathematical ways of representing it is similar to a histogram: Take all the words in alphabetical order and then associate with each word the frequency of its occurrence in the document. For example, the document term vector for the sentence given below is shown in Table 4.1.

THAT THE GND CTLR ACKNOWLEDGED THAT I WAS 'CLRD TO GO' AT INTXN BRAVO

17 Table 4.1. Term by document vector acknowledged at bravo clrd ctlr gnd go i intxn that the to was

1 1 1 1 1 1 1 1 1 2 1 1 1

In a conventional term by document vector, the entries correspond to the word stems instead of the actual words. But in our case the words themselves were used due the presence of large number of acronyms and misspellings. The term by document vector in this case has already been generated based on the results from section 4.4. The only key difference is that instead of having words sorted in alphabetical order; there are numbers, which represent words sorted in ascending order. A document by term vector also doesn’t have normalized frequencies in its representation. This also doesn’t harm the approach in any way since the numbers are still as far apart as before but now in a reduced scale. 4.5.2 Inner Product The inner product (or the dot product) is an operation that takes two vectors and returns a single number as the output. To get the scalar or dot product of two vectors, all the elements in the first vector should be multiplied to all the corresponding elements in the other vector and the sum of the result is the dot product.

18 The easiest way to calculate the similarity between two documents is to take the dot product of the document by term vector of the two documents. So, for example, take the following sentences. S1. THAT THE GND CTLR ACKNOWLEDGED THAT I WAS 'CLRD TO GO' AT INTXN BRAVO S2. CLRD TO GO AT INTXN BRAVO S3. ACFT IDENT TAXI VIA BRAVO FOR INTXN DEP

Table 4.2. Term by document vectors for S1, S2 and S3 Sentence 1 acknowledged at bravo clrd ctlr

gnd go i

Sentence 2 1 1 1 1 1

1 1 1

at bravo clrd

go

1 1 1

Sentence 3 acft

1

bravo

1

dep for

1 1

ident intxn taxi

1 1 1

via

1

1

intxn

1

intxn

1

that the to

2 1 1

to

1

was

1

Taking the dot product between each part of document by term vectors, we have: S1 . S2 = 1 + 1 + 1 + 1 + 1 + 1 = 6 S1 . S3 = 1 + 1 = 2 S2 . S3 = 1 + 1 = 2

19 Looking at the results of the dot product, it is easy for us to say that S1 and S2 are more similar compared to S1, S3 and S2, S3. The same result will hold if the use of frequencies is substituted by the use of normalized frequencies. In other words, instead of the result being 6, 2 and 2, it would have been some fraction with a similar order. 4.5.3 Singular Value Decomposition Singular value decomposition (SVD) is the a process of factorizing a nxm matrix M into multiple matrices of the form U$V* where U is the unitary matrix over some feature space with real numbers. $ Contains the singular values, which can be used to transform other vectors. V is a set of values that are orthonormal directions of M. V* is the conjugate transpose of V. The training set is used to generate $, which is then used to generate the features for both the training and test set. 4.5.4 Gram Schmidt Process The Gram Schmidt process is used for converting a linearly independent set of vectors v1, v2, v3, …, vn to a orthogonal set of vectors q1,q2,q3, …, qn in the inner product feature space (commonly the Euclidean space). The process relies on iteration over the equations.

where projq v is

20 Due to floating point issues, the direct algorithm described above doesn’t give exact results on a computer. There is a stable version of the algorithm1 that is used and given below. GRAM-SCHMIDT-STABLE for i =1 to n do vi = ai end for for i = 1 to n do rii = ||vi|| qi = vi / rii for j = i + 1 to n do rij = qiTvj vj = vj - rijqi end for end for

4.5.5 Applying Gram Schmidt The concept of latent semantics relies on the concept of co-occurrence, since we are not using any external semantic tool like wordnet or semantic net. We can use the fact that two words occurring in the document together are related to each other. This fact is can be used by generating the Eigenvalues using SVD [11]. Estimating the Eigenvalues of a term document matrix by matrix factorization is a very computationally intensive task. As a result, we use the Gram Schmidt orthogonalization (GS) for estimating the singular values and using it to reduce the dimensions of the space effectively. The GS method is known to provide similar results to the latent semantics kernel (LSK) [12]. All the feature values of each document are normalized, for all documents. The document term matrix is created and given to the GS algorithm for processing.

1

http://www.phys.ocean.dal.ca/~lukeman/teaching/3170-3111/slides/Lecture8.pdf

21 To generate the training data, •

The 2-norm is taken for all the document vectors.



A biasing factor is applied on the norm for all the training documents containing the anomalies.



K (resultant feature space size) Documents with high biased norm score as taken as the primary vector for calculating the singular value features.



The GS algorithm is run on all the documents using these K documents. The reduced features obtained are used for training.

To generate the test data, the GS algorithm is run with the K documents selected in the training stage to generate features for the test documents. The training and test files are generated using the features computed in the train and test stage, as described above. Even though there exists an unsupervised way of selecting the K documents [13]. It was seen that the unsupervised approach did not give nearly as good as the supervised method [12]. 4.6

Classification

SVM-light2 is a C language based implementation of SVM. It is small in size, fast to run and very easy to use. These inherent advantages of SVM-light made us use it. It comes with two main executables: svm_learn and svm_classify. The learner svm_learn takes the training data and generates a model. The classifier svm_classify takes the test data along with the model and generates the classifications.

2

http://svmlight.joachims.org/

22 4.6.1 Learning Training a SVM with transduction is quite simple. Once we have all the features generated using the methods from sections 4.2 to 4.5, we only need to put the features in a format that the SVM-light learner can understand. The input format for the training file is class label either (-1 or 1) followed by a space separated list of feature number and value pairs. In our case, if the document belongs to that anomaly we assign it -1, otherwise we assign it 1. In order to use transduction, the unlabeled instances also need to be passed. These unlabeled instances are assigned class value of 0. See Figure 4.6 for an example of a training file. Once the learner encounters a line with label zero it changes to the transduction mode of operation. -1 101:0.23 132:0.39 159:1.23 160: 2.02 161:0.25 0 101:0.213 102:0.39 109:1.93 111: 0.102 116:0.95 130: 0.234 1 101:0.123 102:0.69 105:0.823 116: 0.92 121:0.45 -1 104:1.025 178:7.34 180:1.4 Figure 4.6. Sample Training File content Finally svm_learn is run to start the learner when given the training file and model file name as parameters, as shown below. svm_learn train_file model_output_file 4.6.2 Classifying The test file is created the same way as the training file. All the narratives that are not labeled are put in a file. The features that is required for the classification of a test document is generated using the same set of reference narratives as in section 4.6.1. Each test document is processed through the steps mentioned in sections 4.2 to 4.5. After going through these stages, the features that are generated for each document are taken

23 and a zero is put in place of the class value while passing the features for classification through the SVM. A sample test document is shown in Figure 4.7. 0 101:0.23 132:0.39 159:1.23 160: 2.02 161:0.25 0 101:0.213 102:0.39 109:1.93 111: 0.102 116:0.95 150:2.345 0 101:0.123 102:0.69 105:0.823 116: 0.92 121:0.45 Figure 4.7. Sample Test File content To use the classifier the executable svm_classify has to be run with the test file, the model file and the output file name given as command line switch., as shown below. svm_classify unlabeled_documents_file model_file classification_output_file 4.6.3 Interpreting Output SVM returns a set of numbers around zero, one per line. Each line corresponds to each document instance. If the number is closer to -1 then it basically means that SVM classifies the instances negative, but if it is close to 1 then it means it is positive. The degree of confidence SVM has on its decision is indicated by the number it churns out. A threshold of 0 was used for determining the class. Anything above zero was considered positive and everything below or equal to zero as negative.

CHAPTER 5 EVALUATION

5.1

Dataset

As mentioned before, the evaluation dataset contains the aviation safety reports, with over 62 anomalies and over 130,000 narratives. In order to reduce the scale of the task, only 8 anomalies were considered in our experiments. The number of relevant narratives (i.e., the number of narratives containing at least one of the eight anomalies) was over 75000. In order to make sure that the program can run on everyday hardware only 8000 narratives were used. By using lesser narratives the composition of low frequency anomalies in the dataset was increased. Since a single narrative can have more than one anomaly the total need not be same as total number of narratives. The table below lists the anomalies and the number of narratives associated with these anomalies. The evaluation was performed using 2-fold cross validation. This was done in order to have a smaller percentage of training instances, which would add greater value while developing a semi-supervised approach. Each fold contains 4000 narratives and the details on the distribution of classes in the 2 folds are given in columns.

24

25 Table 5.1. Dataset Compositions Class Anomaly Description 1 2 3 4 5 6 7 8

Aircraft Equipment Problem – Critical Airspace Violation – Entry Incursion – Runway Inflight Encounter – Turbulence Inflight Encounter – Weather Inflight Encounter – VFR IN IMC Maintenance Problem – Improper maintenance Non Adherence – clearance

5.2

Experiment Setup

Fold 1 1028 513 541 523 1060 505 510 1455

Narratives Fold 2 Total Participation 1033 2061 16.78 537 1050 8.55 573 1114 9.07 587 1010 8.22 1055 2115 17.22 521 1026 8.36 496 1006 8.19 1443

2898

23.60

Identification is done using binary SVM classifiers. In this case we use 8 binary classifiers setup to handle each anomaly independently of each other. The experiments and results will be put forward starting with the existing methods, followed by our proposed approach. 5.3

Baseline Experiments

We used three different approaches as baselines systems: •

The NASA approach with the SVM.



The Naïve based EM.



The Gram Schmidt Kernel based SVM.

5.3.1 Plain SVM The most basic of the methods that can be used on the dataset set readily is classifying using a linear SVM. In order to do this we take the output generated by the method described in

26 section 4.2.3 and pass it as input to the learning algorithm. The SVM-light package, which is based on [7], was used to classify the data. For the classification stage all the options were set to their default values. Words were selected using the IG metric mentioned in section 4.2.4. The SVM-light system accepts input as line separated instances. Each instance is given in the form of class label (-1 or 1) and then followed by a space separated feature number colon feature value format. The feature number should be strictly greater than zero. The learner learns and generates a model file after processing the labeled instances. The test instances are also passed to the SVM-light system in the same way as described above. But instead of the class label at the beginning of the line, some placeholder is given. The output provided by the system is a number in the range of -1 to 1, with numbers closer to -1 meaning a classification of -1 and ones tending towards 1 meaning a classification of 1. A threshold of 0 was used to discretize the output. The results of the experiment are tabulated in Table 5.4 and graphed in Figure 5.1. The rows of table 5.2 correspond to the number of features selected by IG and used by the SVM; the columns are the eight anomalies.

Table 5.2. F-Scores of Plain SVM No. Of Words/Class 500 1500 2500 5000 All

1 83.21 83.50 83.65 83.61 83.92

2 62.13 63.02 62.99 62.79 61.40

3 85.64 85.27 85.49 85.44 85.50

4 50.19 49.28 49.21 49.34 49.06

5 65.38 65.63 65.48 65.33 65.34

6 77.32 77.54 77.85 78.12 77.65

7 71.80 72.30 71.87 72.48 72.57

8 78.42 79.12 78.95 79.07 79.13

27

Figure 5.1. Results of Plain SVM As seen by the results, the approach of just selecting the words using the IG works very well with this dataset with the overall F-Score reaching 73.81. The key reason that it fails to perform well in the case of anomaly 4 is that this anomaly (Inflight Encounter – Turbulence) is almost all the time accompanied by anomaly 5 (Inflight Encounter – Weather), so they have a lot of common words used and are not easily distinguished from each other. In case of anomaly 2 (Airspace Violation – Entry), the low performance can be attributed to the fact that there is not that many good shapers found due to the low number of training instances belonging to this anomaly class. Overall, anomalies with higher presence were detected well except for anomaly 5 due to a large portion of overlap with anomaly 4. 5.3.2 Naïve Bayes EM The Naïve Bayes classifier is known to provide good results in case of text classifications tasks. In this experiment the Naïve Bayes classifier is used in conjunction with EM as given in [6]. The details of the method can be found in section 3.2.1. Unlike the previous methods

28 where a standard learning algorithm package was used, here a custom Naïve Bayes EM algorithm was coded. Doing this allowed for greater flexibility on how the input was provided and easier tuning of the parameters. The unlabeled data that we use is the test set. The input given to the system is again taken from the output of sections 4.2.3 and 4.2.4. Since the way the EM algorithm works is different from the way SVM works, a few changes need to be made to in the way the input was passed and handled. In order to keep the input format same as in the other methods the missing features (the words that don’t occur in the narrative) is handled in the code, by making the value of that feature to zero. Furthermore, instead of using the frequency of occurrence of words in the instances as given in the method at section 3.2.1 the normalized frequency is used. The output generated by the method is not the same as the one generated by the SVM classifiers. The output generated is the probability of occurrence of the anomaly, which is then converted to either -1 or 1 depending on the probability values. The importance given to the unlabeled data was varied. The best scores obtained by varying the importance of the unlabeled data are listed in the table 5.3. The average number of EM iterations was 50.

Table 5.5. F-Scores of Naïve Bayes EM Unlabeled data Weightage/Class 0.1 0.25 0.5 0.7 0.9

1 83.69 83.57 83.35 83.26 83.26

2 58.71 56.46 54.16 52.43 51.59

3 86.55 86.74 86.83 86.68 86.76

4 59.53 57.89 55.75 54.30 52.80

5 69.90 68.14 66.31 64.93 63.91

6 69.84 67.18 64.98 63.92 63.47

7 72.73 72.38 71.68 71.44 71.23

8 78.57 78.47 78.06 77.81 77.27

29

Figure 5.2. Results of Naïve Bayes EM The overall F-score obtained over all the anomalies was 74.21. All words are used, due to which there are more word frequencies that can influence the outcome, as a result it performs better on both anomalies 4 and 5. It performs poorly on anomaly 2 due to the fact that the narratives do not contain many words that are repetitive across different documents of the same anomaly with high frequency. The performance of the other anomalies is similar to the plain SVM approach. So use of unlabeled data improves the performance of the Naïve bayes classifier but it is still comparable to a supervised SVM. 5.3.3 Gram Schmidt Kernel In this baseline, we generate the features using the Gram Schmidt Kernel (GSK) algorithm mentioned in [12]. The entire original preprocessed dataset was used to generate the features using the GSK algorithm. The number of feature chosen was 100, 500 and 1000. The set of biases used were 1, 2 and 3.

30 Once the features were generated, the labeled feature instances were given as input to SVM with all default options for training. -1 was used if the anomaly was present and 1 was used if the anomaly was not present. Then the test file was generated again using GSK via the exact same procedure as in training and passed as input to their corresponding models. The result was determined using a threshold of 0. Instances with scores below 0 were classified as -1 and scores above 0 were classified as 1. The best results obtained are tabulated in Table 5.4 and graphed in Figure 5.3. The overall F-Score obtained was 72.77. The GSK approach reduces the features in the feature set similar to the word selection approach, by combining features that are most relevant. Hence both the methods perform similarly.

Table 5.4. Results of GSK Class 1 2 3 4 5 6 7 8

Precision 86.37 90.67 88.35 79.91 77.56 87.43 81.79 80.73

Recall 79.62 46.36 81.76 34.30 55.03 68.49 64.88 76.99

F-Score 82.56 61.29 84.90 47.98 64.38 76.68 72.36 78.81

31

Figure 5.3. Results of GSK 5.4

Our Semi-supervised Approach

The semi-supervised approach has two stages. The first stage is based totally on the word selection using IG and the subtractive log method mentioned in section 4.4. The second stage further reduces the dimensionality of the feature get obtained in the first stage by applying our instance selection based method to generate the features. 5.4.1 Stage 1: Our Word Selection SVM with transduction can incorporate both labeled and unlabeled information to potentially yield give out higher classification accuracies. As in the case of plain SVM, the SVM-light package was used with default options. In order to run SVM-light in the transduction mode a few extra things need to be done. The input is generated in a similar manner to the previous method. The output generated at section 4.3.3 is used with a few extra things, since we need to combine the labeled and unlabeled instances for processing through the learning algorithm. The way the

32 class labels are passed is different, for the instances that are labeled a -1 or 1 and 0 is passed for the unlabeled instances. In all the experiments all the test instances were passed as the unlabeled instances for training the learner. The learner learns from the data given to it generates the model, which is used for classification as usual. Unlabeled instances are passed in the same way as they were passed in the previous method. Again the classifier returns a confidence score between -1 and 1. The threshold of 0 is used to classify an instance as belonging to either -1 or 1. The experiment was performed with the application of word selection as given in section 4.3.2. The results of the experiment for 500, 1500, 2500 and 5000 words are tabulated in the table 5.5 and graphed in Figure 5.4. Comparing this table with those for the baseline systems, we can see that combining labeled and unlabeled data gives better results than just having just labeled data for training. When looking at the result one of the key things that stand out is the classification results obtained for anomaly 4. In the preceding experiments, the way the word selection was done was using the frequency of occurrence of the words in the documents. However, in this experiment, we use only binary valued features that indicate the presence or absence of a word in the document, instead of the number of times it occurs in the other documents, which was the case with the plain SVM approach. The slight increase seen in all the other anomalies is due partly to the presence/absence modification and also partly to the fact that both the labeled and unlabeled data was used to learn the features by the transductive SVM classifier.

33 Table 5.5. Results of Stage 1 Word/Class 500 1500 2500 5000

1 84.16 84.21 84.61 84.29

2 72.64 73.23 73.57 73.88

3 85.99 86.36 86.79 86.24

4 69.53 69.71 69.68 69.94

5 72.48 72.00 72.23 71.96

6 81.30 81.94 82.22 82.81

7 76.73 76.54 76.53 76.13

8 79.70 79.04 79.30 78.91

Figure 5.4. Results of Stage 1

5.4.2 Stage 2: Further Dimensionality Reduction We continue on the reduced feature set produced by our word selection method. The one interesting piece of observation that can be incorporated into this reduction method is using word presence instead of frequency of occurrence of the words in a given document doesn’t hurt the overall classification, in low classification rate anomalies like anomaly 4, word presence was very beneficial. Since we can’t use an approach of replacing the normalized frequencies with 1 or some fraction in order not to influence the value decomposition. We

34 use the method of taking 1-norm of the constant based feature set generated in the previous step. 5000 features were selected in the word selection phase, so that none of the documents are left with a very small number of features. The train and test feature reduction algorithms are run on this 5000 feature set. Then the transductive SVM with default parameters is used to classify the data. The results tabulated in Table 5.6 and graphed in Figure 5.5 were obtained using the reduced features of 250, 500 and 800 with a bias value of 3. The overall F-score obtained was 78.62 compared to the overall F-score of 78.64 obtained from stage 1. The F-scores doesn’t change much due to the application of the feature reduction using GS, but for classes with average participation reduction gives better results compared to word selection. However it performs badly in cases were there is a good amount of over lap between classes as in anomalies 4 and 5.

Table 5.6. Results of Stage 2 No. of Features/Class 250 500 800

1 82.93 84.41 84.77

2 70.51 73.51 74.01

3 85.85 86.17 86.77

4 66.73 68.31 69.05

5 70.57 72.32 73.03

6 81.13 81.87 82.41

7 76.46 77.26 77.22

8 77.96 79.09 79.33

35

Figure 5.5. Results of Stage 2 5.5

Result Comparison And Error Analysis

Looking at the result for the various approaches, it is very clear that the word selection based approaches perform very well on the ASRS dataset. The performance goes down when the words are spread over lot of documents having similar frequencies in different documents. But overall in most cases the performance for classification is good, considering that all approaches are bag of words based. The best approach turned out to be our approach of word selection using a combination method instead of plain IG and using the entire dataset for training the transductive SVM classifier. The GS kernel based baseline method for determining features for improving the classification performance gave better results then the other existing methods, but was not as good as the word presence method used in combination with a transductive SVM based linear kernel.

36 The reason why our GS based approach of ours is not performing so well is perhaps due to the fact that may be the reference points used for getting the new feature space is not determined by a reasonably good approach. Switching to simulated annealing or a transductive approach of selecting the reference documents could boost the results.

Table 5.9. Comparing the F-Scores Class 1 2 3 4 5 6 7 8 Overall

NASA 83.92 63.02 85.64 50.19 65.63 78.12 72.57 79.13 73.82

EM 83.69 58.71 86.83 59.53 69.90 69.84 72.73 78.57 74.21

GSK 82.56 61.29 84.90 47.98 64.38 76.68 72.36 78.81 72.77

Stage 1 84.61 73.88 86.79 69.94 72.48 82.81 76.73 79.70 78.64

Figure 5.6. Comparing all results

Stage 2 84.77 74.01 86.77 69.05 73.03 82.41 77.22 79.33 78.62

CHAPTER 6 CONCLUSION Few methods use the entire dataset to generate features for a single document in the dataset. Through this thesis a method is proposed which lets us combine the various aspects from the different documents to form the features, resulting in good scores. The ASRS dataset is very word-centric. This aspect is clearly shown through the experiments conducted just by using different ways of word selection. We could get really high F-scores just by using simple word selection methods and passing it through a learning algorithm like SVM. Combining unlabeled and labeled data using transduction gave good improvements over all the existing approaches but it could perform even better if both the labeled and unlabeled documents are used together to select words. The semi-supervised approach developed using the concepts of word selection and GSK worked far better on this dataset then just GSK. The key to our method is the way the instances are selected and used to generated features for other documents. The main contribution of this thesis lies in our view that documents should not be used just as input data but also as points of references from which patterns can be derived for future classification. There is a lot of room for improvements, especially with respect to how the original instances are selected for using during feature generation. The current method of using the training data, though effective, can be bettered to use both the training and test data. Also, the broader the spread of the reference points, the better the result would be. The word selection 37

38 metric can also be improved on to provide better features to the GSK module, such that more features are present per document without bringing in contradicting features. At first glance, it appears that getting high accuracy for determining the anomalies associated with them would be an extremely complicated task, requiring many stages of processing. However, the results of our experiments and the details of the new approach suggest that sometimes it is good to think simple. At least for the ASRS dataset, it turns out that the simpler the approach the better the results we get.

REFERENCES [1] Christian Posse, Brett Matzke, Catherine Anderson, Alan Brothers, Melissa Matske, Thomas Ferryman. Extracting Information from Narratives: An Application to Aviation Safety Reports. In Proceedings of IEEE Aerospace Conference, 2005. [2] Pedro Domingos, Michael J. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2-3): 103-130, 1997. [3] David D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In Proceedings of the European Conference on Machine Learning (ECML), pages 4-15, 1998. [4] Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. In proceedings of the European Conference on Machine Learning (ECML), pages 137-142, 2005. [5] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. [6] Ashok N. Srivastava, Ram Akella, Vesselin Diev, Sakthi Preethi Kumaresan, Dawn M. McIntosh, Emmanuel D. Pontikakis, Zuobing Xu, Yi Zhang. Enabling the Discovery of Recurring Anomalies in Aerospace Problem Reports using High-Dimensional Clustering Techniques. In Proceedings of IEEE Aerospace Conference, 2006. [7] Kamal Nigam, Andrew Kachite McCallum, Sebastian Thrun, Tom Mitchell. Text Classification from Labeled and Unlabeled Documents using EM. Special Issue on Information Retrieval, 2000. [8] P. Dewilde, Ed. F. Deprette. SVD and Signal Processing: Algorithms, Applications and Architectures. Elsevier Science Publishers, pages 3-41, 1988. [9] Gene H. Golub, Charles F. Van Loan. Matrix Computations, 3rd ed., Johns Hopkins, 1996. [10] Thorsten Joachims, Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Proceedings of 14th International Conference on Machine Learning, pages 78-85, 1996.

39

40 [11] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990. [12] Nello Cristianini, John Shawe-Taylor, Huma Lodhi. Latent Semantic Kernels. Journal of Intelligent Information Systems, 2002. [13] A. J. Smola, O. L. Mangasarian, B. Scholkopf. Sparse Kernel feature analysis. Technical Report 99-04, University of Wisconsin, Data Mining Institute, 1999.

VITA Suraj Raghuraman was born in Madras (Chennai), India on 25th June 1984, the son of Raghuraman and Usha Raghuraman. Suraj spent most of this childhood breaking and fixing stuff around the house old later shifting to doing the same on the computer saving lot of money to his parents. He did his schooling switching school and places very often all over India. He joined MVIT, Bangalore for a Bachelor of Engineering in Information Science and after a very patient wait for four years he was awarded the degree in 2005. He joined Subex as a Fraud & Risk Management engineer during the final years of his Bachelor. Suraj was very fond of opening companies during the later years of his Bachelors and during his work at Subex, he started a few companies and closed it down; all were profitable. During 2007 he was getting bored with his job at Subex and one of his families ventures finally became big enough to sustain it. He decided to leave his country. He quit his job and joined UTD for Master’s in Computer Science in Fall 2007. Currently he works at the HLTRI at UTD.

Related Documents

Thesis
April 2020 31
Thesis
October 2019 45
Thesis
July 2020 22
Thesis
November 2019 35
Thesis
June 2020 19
Thesis
June 2020 18