Data Mining White Paper

Uploaded by: Sherwin Steffin
0
0

August 2019
PDF

Download

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA

Overview

Download & View Data Mining White Paper as PDF for free.

More details

Words: 9,367
Pages: 59

Preview
Full text

Do it Yourself Data Mining A Three Part White Paper By

Sherwin A. Steffin May 15, 2009

Introduction Back in the days when I was founder and CEO of a Macintosh software publishing company (BrainPower, Inc.), I designed a powerful text analysis product, named ArchiText, for the Macintosh platform. An independent contractor was employed for program coding. When the company closed in 1990, the program disappeared from public view, but to this day, contains features that I have found in no other free or low-priced program. Unfortunately, the program operates only on obsolete Mac computers running 0S 6.1-9.x, a rapidly declining number of computers. Additionally, it was developed long before the Internet, so cannot directly import web pages, but rather must rely on text, or MS Word 4.0 or earlier. Given those limitations, I am looking for a programmer, interested in updating the software design elements into a new program, usable on all platforms, and incorporating features , not available at that time, or for which development costs prevented inclusion. I see this as a project which would be undertaken without any financial transactions between any parties involved in development. The program could be sold as low-cost (Under $20.00) shareware, providing revenue to all involved in the program development. Any interested in reviewing the program having access to a Macintosh with the requisite OS can receive copies by contacting the author at the email address provided in my profile. Please put “ARCHITEXT” in the subject line.

A Rationale for Reconstruction of a Twenty-Year Old Program The first thing anyone considering becoming involved in this project will ask is, “Why should any effort go in to rebuilding a twenty year old program?” As you read through the features and capabilities of this program, you will find several components missing from every modern search engine, as elegant and innovative as is each: 1. A search operator that will facilitate proximity searches. Wikipedia provides a useful definition of the proximity search. As you read this page you will note that while Google provides some facility for this feature, it is more than a little clumsy to implement. While there have been some attempts to implement this feature, they are either clumsy or very limited in their implementation. One example of this is an API designed to do two-word proximity searches. 2. A SEARCH WITHIN SENTENCE or SEARCH WITHIN PARAGRAPH Operator

Page 2 of 59

Perhaps there is no greater frustration than searching multiple terms using the AND operator, and finding thousands of pages containing both terms, yet totally unrelated to each other. Depending on the search terms employed, one can get either some very targeted results or be greatly frustrated by the presence of all terms, unrelated to each other. You are searching for instances where George H.W. Bush AND CIA drug running pilot, Barry Seal appear and identified as having a relationship to each other. You do a quick search, Bush AND Seal. Google reveals a page count of 1,520,000 Pages! Treating both terms as quote-limited phrases, “George Bush” AND “Barry Seal,” still yields a count of 967 pages. Even after scanning link titles the opening and scanning of a substantial number of pages is required to determine whether there was a specific commonality between these individuals. If, however, the two terms can be linked as occurring within the same sentence or paragraph, commonalities can be rapidly identified. 3. Extraction and combination of all sentences or paragraphs having search terms in common In your investigation of the JFK assassination, you have assembled many books and articles which you have converted to text. You want to see how much agreement exists among authors regarding the frame by frame analysis of the Zapruder film. You search for “Frame nnn.” Instead of copying and pasting each paragraph in which the frame is mentioned, you create one document, containing all paragraphs in which this frame is mentioned, along with the source from which they were extracted.``` The following material is a three part white paper, introducing the reader to three approaches to Data Mining, requiring little or no previous training and using low cost or free computer applications. In Part I, the reader is introduced to some principles of text analysis, and then walked through an example of how these principle can be applied using ArchiText, the program described above. Part II introduces some quantitative methods which yield readily interpretable graphic displays. The differences between analytic and Exploratory Data Analysis are presented. The use of Dot Plots, Box Plots, Frequency Analysis, Scatter Plots, and Contingency Tables is introduced, with interpretive examples presented. Part III is an in-depth discussion of the use and interpretation of Contingency Tables, and introduces the concept of Block Tagging, and how it is applied to analytic activity.

Page 3 of 59

Do-It-Yourself Data Mining – Part I Text Analysis Using ArchiText Principles of Text Analysis Before considering the existing program, and the mechanics necessary for updating it, it is important that we share a common understanding of the principles of Text analysis that this author considered in developing the original design of the program. To that end it will be useful to review some of the slides in a PowerPoint presentation designed to clarify those principles. Concordancing

The earliest analysis of text began with a process going back to the development of printing. Called Concordancing, it consists of locating every word within a text, and counting the frequency its appearance within the text. This process was

Page 4 of 59

first applied to scholarly analyses of the Bible: The slide below illustrates a screen shot of a free Concordancing program, TextStat, available on the Net.

Getting the frequency of occurrence of a specific words, by itself, has considerable utility. Here is some of the information which can be derived just from this data: •

Subject Information: Inspection of the proper and common nouns, sorted by frequency quickly provides an overview of the subject of this text collection.

•

Structural Analysis: Teachers, linguists, and others having interest in the structure of language use can apply various ratios and percentages to list contents. Examples include the percentage of prepositions to the total word count, active to passive voice verbs, etc.

Still, the mere presence of a word, or even it’s frequency of occurrence provides relatively limited information regarding it’s usage within the corpus (the totality of all documents under analysis). The next step, therefore, is to view a word or words of interest within a context. You will often find the acronym, KWIC (Key Words In Context), referring to this process.

Page 5 of 59

While not shown here, proximity searching is made available in the Query Editor, initiated with the button shown. Expanding to a full citation provides a full view of just how and were each word appears in the context of the total document. This is quickly accomplished by clicking the Citation button.

Page 6 of 59

The Complexity of Word Tags Using a number of methods, individual words can be linked to each other as they occur throughout the corpus of material under study. This is especially essential when the document contains a large number of names of people, locations, or events which can easily become confusing. Which person is linked to which employer, location, or event? What about individuals sharing the same surname? Separating into categories and clear identities brings clarity to confusion. The next five slides, with text extracted from the 9/11 Commission Report, illustrate this clarification process. We initially begin with a paragraph of unedited text:

Page 7 of 59

The first step in increasing the information value of individual words is to select a category, in this case surnames of individuals. Using the Replace dialog in Word, change all instances of a surname to ALL CAPS.

Page 8 of 59

Any compound word which you wish to have listed completely must have the space character replaced with a hyphen.

Page 9 of 59

Prefixes serve to group words which are members of the same category together so that they will appear in a group within a word listing.

Page 10 of 59

Suffixing of the root word provides differentiation between same names. Thus in the example below, the two “AL-SHEHRIs” are identified as being siblings, but can be separated with respect to individual activity.

Page 11 of 59

Here are some rules for compound words, other than the name of people:

Page 12 of 59

All of the techniques shown above enhance your capabilities for analysis of a document or collection of documents, but none are essential to the full employment of the program.

ArchiText, a Text Analysis (Text Mining) Program Import and Split – Creating Nodes The first step in using ArchiText is to import the corpus of one or more documents into the program. In the example shown below the text of the 9/11 Commission Report is going to be subjected to analysis. In general, if the document is of any significant length you'll want to split it into categories or sections which represent some logical division of the whole. After creating a new ArchiText file, select the file you're going to use.

Page 13 of 59

You can elect to import the entire file, or to split the file into elements, referred to as “Nodes.” If you choose the former, you will have just one node, having the document name.

If you decide to split into sections, you can use any symbolic character (or combination of symbolic characters) as the target string by which sections are split, as shown above. In this example, we have split the entire document into chapters, as illustrated in the Node Directory shown below. Double clicking on any of the Nodes will open a window in which the text of the node appears.

Page 14 of 59

Node Selection Regardless of the analysis you are going to perform, you can select all nodes, an ordered set of nodes, or a discontinuous combination of nodes. Preferences allow you to order nodes alphabetically or by time modified.

Page 15 of 59

Keyword Lists Typically this is the first analysis you are going to do. Typically, you will have done a lot of tag preparation in the original file, as described above. In the selection above, our interest is in identifying the key terrorist players, so the keyword search was restricted to those nodes where they were discussed. After selecting the nodes whose words are to be listed, the keyword dialog sets up the parameters for the listing.

Most of these choices are self-explanatory, but the “stop word list” requires some discussion. ArchiText comes pre-loaded with a modifiable list of words – articles, prepositions and auxiliary verbs, which ordinarily are irrelevant to content. Thus, when this item is checked, these words are eliminated from the frequency Page 16 of 59

listings. However, there are times when these words have usefulness for a given analysis, and they can then be included by un-checking this box.

In this partial view of the resulting frequency list, each person has been prefixed by “p-“ which facilitates grouping all of the those named fitting into the category “Person.” For those occurring with high frequency, we will proceed to extract all information regarding them, and combine that information into a single new node only focused on each of them.

Page 17 of 59

Extract and Combine to make new nodes In our first search, two of the terrorists have been selected. Selecting the “S” tab will automatically initiate the search dialog. Remember that the nodes have been preselected when the keyword list was constructed.

After pressing “Start Search” button, you will see the following results in the Directory.

Page 18 of 59

Notice that the number of occurrences of the names of the two terrorists, within each node, are highlighted. The next step is to extract just this information, and combine it into a new node. To do this, select “Combine Nodes” from the “Analysis” menu. In the example shown below, we have searched for George Bush, and are extracting all occurrences of his name throughout the nodes.

Select “Embed Node Name” if you want the source nodes named in the new node. After completion, a new node containing only those instances in which Bush is named in a paragraph. The result of this combination looks like this:

Page 19 of 59

This illustration is, of course, only a small potion of the nodes in which the Bush Name occurs. If you wish, you can “drill down” further, building a keyword list for this node alone, and searching for other combinations related to Bush, as they occur within Paragraphs or sentences in which his name occurs. If desired, you could build extracted nodes for any combinations of Bush and other words included in your search. Identify Relationships – Node Maps In a 500 page document there are obviously a huge number of relationships between people, events, locations, and other categories. Node Maps facilitate your finding and manipulating these relationships in an infinite number of ways.

Page 20 of 59

New maps are built in the same way as are new nodes -- by using the create button for maps in the directory dialog. You will note that there are number of nodes which are not on the map, but which are available through selection and pressing the "Add to Map" button. When nodes are deleted from the map, they appear in the left column which is the "On-Call" list. Another way that nodes can be added from the On-Call list is through a search which selects some of the nodes in this list.

Page 21 of 59

One way of visualizing the nodes found in a search is to change the size of the nodes selected by that search. This option is available by selecting, "Change Node Size," in the Map menu found on the main toolbar.

Page 22 of 59

A far more powerful option is available. Using one all of the eight linking tools which are available a "Parent Node" can be connected or linked to each of the nodes to which a relationship exists. One example of this linking is shown in the map below. In this case, Terrorist 001 (Osama bin Laden) is linked to each of the chapters in which his name appears.

Page 23 of 59

As you see below every node which is linked to another is illustrated in the nodes window. Double clicking on any node name opens the note window, and depending on the preference settings, will either open the source node and destination node, or simultaneously open the destination node while closing the source node.

Page 24 of 59

Implications for data mining The methodology employed here facilitates the discovery of all kinds of relationships between people, events, locations, and in fact any word or phrase to any other. Typically as relationships are discovered new sub nodes will be created so that those relationships can be examined and further linked to other relationships. It is not necessary to do this specialized tagging which will be explained in the following tutorial directed at methods of text analysis. This simply makes it easier to define categories of items making their location and identification easier within ArchiText and providing a basis for quantitative analyses which can flow from these categorical classifications.

Page 25 of 59

Some limitations While the design of this program offers features which this author has found in no other program, because it was designed in 1988, there are some limitations and deficiencies which demand starting from the beginning and rebuilding the program shell. Listed below are some of the current problems which must be resolved for the program to reach its potential power for its users: •

By far the most serious deficiency in this program is the fact that it will only operate on older Macintosh computers still installed with OS 9x or earlier. The search and linking functions are available on no other program, except those enterprise-level highly expensive data mining systems. Thus the program needs to be updated such that is usable on any platform.

•

As currently constituted the program only can import text in ASCII format, and lacks the capability to open Internet files or read from them.

•

There are number of deficiencies in the search algorithm, most particularly in the program's inability to process numerical searches. Thus, a search for a number greater than, equal to, or less than another quantity can not currently be accomplished.

•

As displayed above, they the mapping capabilities of the program are very limited and a number of modifications should be made so that more effective pictographic displays are readily available. An example of one such possibility is shown below.

Page 26 of 59

What’s Next? Part II of this series discusses application of some of the more traditional methods of Data Mining, describing the ease with which standard statistical methods may be used to determine and present complex relationships existing between words and numbers, without the necessity for advanced expertise, or expensive and complex professional analytic software.

Page 27 of 59

Do-It-Yourself Data Mining – Part II Concepts and Display Introduction In beginning our consideration of Data Mining, readers will find many, if not all, of the concepts involved to be to be foreign to their past experience. “Data mining (DM),” also called Knowledge-Discovery in Databases (KDD) or KnowledgeDiscovery is the process of automatically searching large volumes of data for patterns using tools such as classification, association rule mining, clustering, etc.. Data mining is a complex topic and has links with multiple core fields such as computer science and adds value to rich seminal computational techniques from statistics, information retrieval, machine learning and pattern recognition.” For most of us our major approach to processing information ultimately depends on the viewpoint of others. This is because, our entire education has largely consisted of memorizing facts, and solving problems based on the rules that others have given to us. We typically find that we rely on "facts," and "conclusions" which come from those whose viewpoint is most similar to that which we have developed over a long period of time. Here is a graphic example which illustrates this process. One reason that many avoid engaging in DM is the perception that it requires not only training in statistics, but in database usage as well. Typically, those employed as data analysts will have formal training and experience in database programming languages, statistical programming, as well as research design. This paper seeks to provide those who have competence in general computer applications with the intellectual tools necessary to shortcut the heavy duty software and training used by the professional. It all depends on Point of View Assume that you hold a “Liberal” viewpoint with respect to the war. You will tend to reject the statements of conservatives – in essence “screening out” the Red view of any dispute. You will tend to accept, in fact, even receive, only that information which is in agreement with these long held views.

Page 28 of 59

Conversely, if the view you hold is consistent with those held by conservatives, you will tend to see and incorporate into your thinking the views held by members of that group.

None of us is immune to the tendency to accept or reject new information to the degree that it is consonant with our previously constructed world views. The analysis tools discussed here can serve to free you from these frameworks and provide new ways of looking at the information contained within the text. What you will be learning Looking at the names given to the disciplines and knowledges used by those who are engaged in Data Mining, you are likely to be thinking that such training is far beyond your own education. This article is designed to show you that, while some of what those who do DM have advanced academic experience, that the core principles can be learned and employed by everyone – and in fact, can be used by students in their early teens. For those lacking a strong background in statistical analysis and the tools of quantitative analysis such as Regression, Correlation and Cross Tabulation (Contingency Tables), this material will initially appear be very new. Most who have little or no experience using and calculating statistics tend to think of statistics as a discipline which uses numeric values to do reach conclusions or results. While certainly this is very much the case, there are also a number of statistical methods which use text exclusively, or as a part of the calculations. As you proceed through this tutorial, you will find that the concepts introduced are much easier to understand than you had previously believed to be the case.

Page 29 of 59

From Numbers and Words to Conclusions – An Example Getting acquainted with statistical ideas Before starting we need to define some terms used by statisticians when they carry out research. Populations and Samples Two terms you will run across throughout this document are Sample and Population. The 100 8th graders, whose responses we are going to obtain, are a Sample of all boys and girls in the 8th grade, going to school in the United States. This total is referred to as the 8th Grade Population. Whatever we do with the sample, the greatest concern is that the results we find for the sample, are very similar to those which would be found if the entire Population could be measured. Variables Information regarding either Populations or Samples is broadly divided into two kinds of Variables – Discrete (Categorical) and Numeric. The height of each member (case) in the sample as well as the grade point average are the numeric variables which will be used in this example. Categorical (Text) variables or labels which are assigned to place each case in one of two or more different Categories. “A,” “B,” “C,” “D,” and “F,” are all elements of the Categorical Variable named “Grade.” These categories are either located and extracted in the text of documents being analyzed or derived from equations, such as shown below. In this instance a formula was used to derive a letter grade from the grade point average, for each case. Figure 1 Deriving Letter Grades from Performance Average

All of the categorical variables in this example come from mathematical derivations, since there is no real textual data from which they can be extracted. Nonetheless, you will find yourself dealing with this same construction of categories, especially when the text source is a series of tables.

Page 30 of 59

First Steps – Defining the Analysis More than anything you do, this is the most important element of conducting your analysis. Your efforts are done for some purpose. This first step is where you define the questions for which you seek answers. Here is an example of the kind of question that you may seek to answer through your analysis. There is a huge amount of material about whether global warming is taking place, and if it is, what people can or should do about it. There are other scientists who say that the world’s temperature has nothing to do with what man does or does not do. How can I determine who to believe? Or, here is another: The President tells us that the way we need to go in Iraq is to add troops to the war in Iraq, but the majority of the American public says we should get out. So too, do many of our generals, the majority of Congress, and many experts. Who is right? In order to illustrate the use of simple categories, an example has been developed from a survey, sampling answers from students attending a middle school located in a small Texas town. This hypothetical analysis generated a number of questions which illustrate how one might use labels to evaluate what would otherwise be necessarily considered to require numeric data. Here are some of the questions this analysis will answer. 1. Given the choice, during leisure time, of “Playing Sports,” or “Doing something else,” which choice will 8th graders make? 2. Is there a difference between genders with respect to which is more likely to make either choice 3. Is there a difference in academic performance between boys and girls? 4. Is there a difference in average height between those who select Playing Sports,” against those who select “Doing something else?” 5. Is there a relationship between height and the choice of use of leisure time? 6. Is there a relationship in grades between those who select Playing Sports,” against those who select “Doing something else?” If this sample had been actually collected, it would have been done by asking each sampled student the following questions: 1. Are You a boy or a girl? ________ 2. What is the most recent report card grade you received in this class? ____

Page 31 of 59

3. How tall are you (in inches) ____ 4. If you have a choice of activity on the weekend which would you prefer to do? ___ Play Sports___ Do something else? [Put an “X” in the line next to your choice.] From a combined word count of a little over 500 words, you will find that (100 sets of answers to the 4 questions) a wealth of information can be obtained. You will begin to reach some powerful answers to your questions and know the likelihood that your answers are correct. Composition of the Sample Boys vs. Girls The first thing we want to know is whether there are an equal number of boys and girls in our sample. As you can see both are equally represented in the sample. Figure 2 – Gender composition of sample

Overall Academic Performance What about academic performance of the all in the sample? Since the letter grades are derived from numeric averages of performance throughout the school year, we use a Bar Chart to inspect the number of students receiving each of the five grades. Figure 3 – Grade Frequencies

Page 32 of 59

The horizontal axis contains the ranges of grades, in this case with the lowest being between 51 to 60, proceeding in increments of 10 points, ending in a perfect score of 100. The vertical axis gives you the number of cases within the selected ranges. We can see that performance is pretty much as we might expect – A few in the failing and top ranges, more in the high and low ranges, and most right in the middle…where a “C” is the grade awarded. But that is far from the whole story. Looking at this graph, you will certainly want to know whether there is any difference between the academic performance of the boys vs. the girls. The first step is one of inspecting the frequency of each category Boys and Girls, with respect to the entire sample. To do this, we use a plot called a “Dot Plot,” and with the boxes added, the plot is referred to as a “Box Plot.” Figure 4 – Dot and Box Plots of Performance Averages

As you can see, the girls did better than the boys, both in the median of their scores, (white line in the center of the shaded areas) and in the top grade Page 33 of 59

received by a girl vs. that of a boy. While none of the girls failed (Score below 61) two of the boys did. Removing the boxes, we see the three boys who are “outliers,” those who represent extreme low values disconnected from the rest of their group, as well as the three high-scoring girls who are also disconnected from the rest of their group by their high performance. Yet, while we have a good look at where the average scores fall, and the distribution of scores in each groups, we really don’t have any accurate count of how many in each group received which letter grade. To get this precision, we instead turn to a statistical tool referred to as a “Contingency Table.” As you look at the table, what becomes evident that, at every grade level, the girls did better than the boys. There is something else that is evident. If you look at the p-value shown at the bottom of the table, you will note that it is shown as p = .0478. This tells you that there is slightly less than 5 chances in 100 that you would find the boys equaling or surpassing the performance of the girls, if you repeated this survey with other boys and girls in the 8th grade. Figure 5 – Table Boy vs. Girl Grades Received

Finding relationships To this point, we have been working largely with using word frequencies to interpret the information we have displayed. Now we are going to some numerical methods for determining relationships between the underlying variables which have led to classifying some of the variables into words. Height and Academic Performance The 8th grade is a time of great change in physical growth for adolescents. Thirteen year old girls tend to be well into puberty, while many boys lag in development, causing a far greater variation in boy’s height than is found with the girls. Page 34 of 59

Figure 6- Comparison of Heights for Boys and Girls

Nonetheless, boys at this age show an average difference, being approximately two inches taller than are the girls. Figure 8 - Height vs. Grades – Full Sample

This Scatter Plot of Height vs. Grade averages for the entire sample is very interesting. There is a modest trend toward those receiving higher grades being shorter than those receiving lower grades Boys are shown as Xs, and Girls shown as small open circles. The red line is the cutoff separating low and high grades. Is this trend the same or different for boys and girls? To answer this question we split the total by gender as shown in the figure below. Figure 9 – Height vs. Grades by Gender

Page 35 of 59

While we see this trend for boys persisting in the left plot, there appears to be almost no relationship for the girls, as shown by the nearly level line in the girl’s plot at right. One of the nice things about this kind of display is that it does not require the viewer to try and interpret complicated numeric calculations – instead, simply looking at the plot makes a number of things evident: • There are only 7 girls as opposed to 18 boys who received grades below “C” (< 71) in this sample. • Looking at the boy’s heights, the shortest boys tended to get the best grades, while the tallest boys, were more evenly distributed in the grades they received. Thus, one might posit that short boys may be more motivated toward academic work then tall, since they have fewer distractions in their attraction to the girls, and less likelihood of being engaged in time consuming sports activity. • Conversely, there is almost no relationship between the heights of girls and the grades they receive. While they may be involved in extra-curricular activities, most parents will severely limit any dating activity by this age group. Choice of Leisure Activity and Grades Recall that there another two choice question in our example. It asked, “If you have a choice of activity on the weekend, which would you prefer to do? ___ Play Sports___ Do something else.” Recall in our example, these students are living in a small town in Texas. In such towns, high school football assumes a high degree of importance. This cultural bias toward sports participation leads us to the following hypotheses: • While the students in our sample are two young to participate in a high school program, boys will have strong aspirations and interest in future participation, leading to leisure time sports activity.

Page 36 of 59

• Since high school football is a male-only sport, girls will show less interest in sports participation, although some, of course, will participate in programs that are equally open to boys and girls. • Regardless of gender, students with larger body mass will have a greater inclination towards sports participation then their smaller counterparts. • For a variety of reasons, students who either participate, or are emotionally invested in organized athletic activity will tend towards lower grade achievement than those not involved. We begin this analysis with a contingency table showing the relationship between academic performance and sports participation: Figure 10 – Table Activity Choice vs. Grade Performance

When the choice of “Play Sports” vs. “Something Else,” is overlaid over grades and height, we see a clear relationship between those receiving poor grades and those choosing to spend their leisure time playing sports, rather than doing “Something Else.” You will also note that this behavior is much more pronounced among the boys (11) as compared with the same choices made by girls (3).

Page 37 of 59

Figure 11 – Overlay of Activity Choice vs. Grades and Height

A Note about the “Findings” In reviewing all of the above, it is important to note that all of these findings are based upon a set of results, constructed by the author. The survey questions described were never actually given to any group of students, and the results are completely factitious. There are, in fact, a number of studies which take an opposing view, finding that students athletes tend to be among the high performers within their educational settings. One, peer reviewed study is available here, involving many more variables than the sample provided here. Statistical Programs For anyone wanting to do serious analysis using the methods described above, a statistical analysis program is required. Many readers will be reluctant to either spend the money, or engage in the steep learning curve required to master many professional level programs. The author uses a professional version of DataDesk, a uniquely powerful, yet, easy-to- use program. A relatively low cost ( $75.00) Excel add-in, for the same program, is offered by the publisher, making available all of the analyses described above.

Page 38 of 59

Do-It-Yourself Data Mining – Part III Using Block Tags to Analyze Text Introduction In Part II of this series of articles, you learned how text and numeric data can be used to extract meaningful and useful information from large collections of textual material. In this section, we look at how the textual content can be reorganized to be extracted, as demonstrated in the previous article. As before, your most important task is going to be to define in advance, the information you are expecting to derive from your efforts. As previously, this will be stated in the form of questions to be answered or hypotheses to be tested. There are two kinds of tagging which will be considered, each form serving to answer different purposes. Word Tags Word Tags are words contained within the document which have particular importance either because of the frequency with which they occur or their association with other words within a sentence or paragraph. By capitalizing, (George BUSH), compounding (New-York), and adding prefixes(DODRUMSFELD) or suffixes (BUSH-POTUS43) to words identified as important, word elements can be combined and linked to others having some element in common. A full discussion of Word Tag was provided in Part I of this series of articles. Block Tags Block Tags are words added by the user to categorize sections of text (by sentence or paragraph) such that Contingency Tables can be constructed showing the dependence or independence on one category against another. Before you scratch your head trying to figure out what I am saying, here is an illustration of the process I am describing: Each year the President presents his State of the Union Speech to Congress and the watching nation. In 2005, a popular President Bush, newly reelected, came to Congress with a sweeping series of domestic agenda which he anticipated bringing into law during his second term in office.

Page 39 of 59

By the same time in 2006, with the war in Iraq going badly, much of the nation’s attention was directed at resolving the war, and away from the issues brought in 2005. Here are some of the questions which flow from this brief description: Was there a significant difference in the emphasis that the President gave to various subjects which appeared in both speeches when compared against the two years in question? Since the speeches are used as a means of presenting agenda, and convincing the audience of the value of the President’s views, what elements comprised the form in which his views were presented? We will closely inspect these and a number of other questions which Block Tags answer.

Interpreting Contingency Tables Before going on to the construction of Block Tags, we need to spend some time looking at the power of contingency tables to provide you with precise knowledge regarding the information you are seeking. There is a great deal available from these seemingly simple numeric tables: In preparing this example, the text of both State of the Union speeches was divided into sentences, and then, several categories were assigned to each sentence: •

YEAR

•

DOMESTIC/FOREIGN AFFAIRS

•

RATIONAL/EMOTIONAL (A “Selector” Variable)

Here is what they look like in generating the sample contingency tables:

Page 40 of 59

In the tables you have viewed previously all were simple 2 x 2 tables. This simply refers to the number of columns and rows within each table. In the example below, the two speeches are in columns, with each sentence classified as referring to either FOREIGN AFFAIRS or DOMESTIC matters.

The question answered by this table is: Was there a difference between the two speeches with respect to a change in emphasis regarding FOREIGN AFFAIRS or DOMESTIC matters? Since the number of sentences in the ’06 speech increases by 17 from that of ’05, (287-260), and the number of FOREIGN AFFAIRS sentences increases only Page 41 of 59

by 27, (116-89) it may first appear that this does not really represent a huge change. To get a preliminary idea of the magnitude of the change, let us first substitute the percentage of column totals to see whether this difference looks important.

We see a small increase (approximately 6.2%) in the number of sentences devoted to FOREIGN AFFAIRS, but we still don’t know whether this increase is important enough to say there has been a real change in emphasis, or this was just a random occurrence… perhaps there were different speech writers, or some other factors which entered in the process of preparing the speech. To make a further determination of whether this apparently small difference is really important, we need to look at another set of calculations in the table – the Expected Values. Without going into how these values are calculated, they should be interpreted in the following way: The greater the difference between actual (counted) cell values and Expected Values, the greater the chance that the row values are affected by the Column values.

Page 42 of 59

All of the expected values are above or below their respective counts by a little over 8 points. Thus we turn to the next measure of differences – the Standardized Residuals. This number (after complex numeric processing) gives you information about the magnitude of the variation of the expected values from the actual counted values.

In looking at the highlighted number, we see that during the ’06 speech, that FOREIGN AFFAIRS showed the greatest increase in emphasis. The question that remains to be answered is: How likely is it that the increase in FOREIGN AFFAIRS statements was related to the time that it occurred (2006) or was it merely due to chance? Page 43 of 59

To answer this question, we turn to two other statistics: The Chi-Square value, and the Probability (p = ) of occurrence:

The Null Hypothesis You would expect that the theory or hypothesis that you are stating, related to the data above would read something like this: The increase in FOREIGN AFFAIRS statements is related to the time that it occurred (2006). Because of the methods by which statisticians calculate the likelihood or probability of an hypothesis being true, hypotheses are stated in just the reverse form from which one would ordinarily expect to see. Thus: There is no statistically significant difference between FOREIGN AFFAIRS statements made in 2005 and 2006. This form of stating a hypothesis is referred to as the Null Hypothesis. When you see a “p-value” for such a statement, it is giving you the probability that this negative form of your hypothesis is true, or correct. Researchers will Reject the Null Hypothesis, when the value of “p” is less than or equal to some value previously determined by them. Thus, if we reject the Null Hypothesis, this means we accept as likely the original form of the hypothesis. Let’s see how this operates when using the contingency table with which we have been working:

Page 44 of 59

The number “p = 0.1355,” is translated as: The probability that there is no statistically significant difference between FOREIGN AFFAIRS statements made in 2005 and 2006 is .1355 (This means that there is a 13.5% chance that there is no difference between the two variables) For most researchers this number is far too high. Most researchers require a pvalue less than .05 (5%) or .01 (1%) and so we would accept the hypothesis, since the .1355 number is too large to reject the statement. Without going into a technical explanation of the Chi-Square value, the general rule is that the higher this number the lower will be the p-value. The “df” refers to “Degrees of Freedom.” It is derived from the number of cells in the table. As df increases, a larger Chi-Square value is required to obtain the same p-value. Neither of these statistics are necessary for your interpretation of these tables. Admittedly, this “double-negative” approach is counterintuitive to almost everyone who has not been involved with statistical research. However any reading of professional journals addressing such disciplines as medicine, psychology, political science and many others, will show hypotheses presented as described above. Selector Variables A selector variable is used to filter all of the counts within a contingency table, reducing the totals each cell to only those cases in which meet the criteria for the

Page 45 of 59

selector. In the example below, we add the EMOTIONAL Selector to the contingency table we have been using.

Inspecting the two tables reveal some important differences between them. • Comparing the Total Counts between the two tables, EMOTIONAL statements account for 27.2% of all statements made by the President in both years EMOTIONAL sentences occurred 3.5 time more frequently in ’06, when compared to ’05. • DOMESTIC sentences decreased from 342 to 97, with the ratio between the two years of exactly 1:1 to EMOTIONAL statements dominating ’06 by a ratio of 3.6:1. • While those total statements regarding FOREIGN AFFAIRS increased by 30% between the two years, those having an EMOTIONAL rating increased by 81%. • Most importantly, the probability for the Null Hypothesis being TRUE has dropped from .135 to .065. • Taking all of the data together the interpretation which follows concludes that the President significantly increased the use of emotional appeals during between the ’05 and ’06 speeches.

Contingency Table Summary While much of the above may have seemed tedious and difficult to retain, the more derived from contingency table analysis of categorical variables, the more powerful and useful will be the conclusions reached from using this tag assignment system,.

Page 46 of 59

Preparing Block Tags Selecting Variables As previously mentioned the number of variables, and the variable headings should be determined in advance. In the example which follows, there are three variables with headings: YEAR, TYPE, AND SUBJECT.

Category Selection The general rule in determining the number of categories you are going to use is to make each variable have the smallest number of categories allowing for the information you seek to discover. Ideally, you will have a 2 x 2 table to work with. Differences between expected and actual values will be at their greatest, giving the opportunity for higher Chi-Square values with the lowest p values, obtainable. Of course, in real situations this is often not possible. In fact, until you are actually in the process of coding variables, you will not know the names of all the categories.

Text Categories Used in the Example YEAR •

05 Speech

•

06 Speech

TYPE • EMOTIONAL = A conclusion based upon emotion, not facts. (“Americans are a compassionate people,” or the “Axis of Evil.”) • FACTUAL = A verifiable statement (“There were 12 marines killed in the helicopter crash.”) • F-CONCLUS = An inference or conclusion, based upon factual data (“Another 20,000 troops are required to end the violence in Baghdad.”) • PROMISE = A commitment made by the speaker. (“I will sign a bill which…”) • REQUEST = A Request made by the speaker (I ask that Congress continue the Tax Reduction…”) •

REQUIRE = A demand made (Congress must pass legislation to…”)

•

RHETORICAL = A statement which communicates no information

SUBJECT •

CONGRESS = Statement regarding Congress

Page 47 of 59

•

ENERGY = Oil, nuclear, alternative energy sources

•

INTERNATIONAL = All foreign affairs matters, except war

•

LAW = Law enforcement and the courts

• MONEY = All things related to the economy, employment, Social Security, Taxes, Budget, health insurance. •

SECURITY = Homeland Security issues

•

WAR

•

YOUTH

Coding the Document Preparing the Source Text Much of the text that you will be organizing and manipulating will come from web pages, pdf files, or numerical tables. All raw data should be converted to Word documents, so that the tools contained within can be utilized. In our example we will be working with the State of the Union Addresses delivered in 2005, and 2006. To assemble the two completed documents, you will need to copy and paste from the text, changing pages, and eliminating extraneous material. Since we are going to be working with tow different speeches, you should code one year at a time, and then combine the two finished documents into one composite complete data set. In a moment, you will see why. Shown below is a fragment of the 2005 Address, showing the first four paragraphs.

Page 48 of 59

1. Use Global Replace to parse all paragraphs into their component sentences.

The result appears as shown below.

Page 49 of 59

2. The following Global Replace dialog readies the test formatting of the two speeches for Block Tag entry.

3. This results in the following text, with “2005” later replaced with “SOU-05.” A separate Replace dialog is used for the 2006 speech.

Page 50 of 59

4. Using the Table Menu, the Text is converted to the following Table:

5. Empty cells can be rapidly categorized using the appropriate tag as each sentence is evaluated. Remember that there are two speeches to be coded

Page 51 of 59

While the above appears to involve a substantial amount of work in preparing the tag tables, Steps 1 through 4 take only a few minutes. After completion of Step 5, simply copy the categorical columns to an Excel sheet, and then import the variables to the statistics application you have chosen. You are now ready to assess the results of the relationships existing among the variables.

Analysis from Two Perspectives Analytic Statistics Dating from the 19th Century, classical analytic statistics are familiar to any who either conducts or evaluates scientific research. This method provides a means for testing the reliability and validity of predefined hypotheses. Thus, a pharmaceutical manufacturer believes that Drug “A” will be effective (or more effective, safer, etc.) then other drugs in treating disease “X.” The investigators apply one or more “research designs,” to conduct experiments designed to prove or test their hypotheses. Important to note here is that the investigators have already determined the expected results, and now are simply seeking to determine whether their expectations are correct or not.

Exploratory Data Analysis There are many phenomena that occur where we really don’t have any clearly defined explanation for the what we observe. This is particularly true in the social and behavioral sciences. What factors, for example, had the most impact on the 2004 presidential election? What changes in American society can be expected if current immigration legislation becomes law? Typically many variables will be involved in the explanation, some of which will have unknown effects, or even not be suspected of having any relationship to the Dependent (y or “predicted” variable”).

Comparing the Two State of the Union Speeches Fresh from his reelection the State of the Union Speech delivered in January 2005 reflected the President’s plans for using his political capital to advance his legislative priorities in the year to follow. By the time the 2006 was delivered, there had been a marked change in the challenges faced by the President. No longer was support for the Iraq war by any means as great as it had been in the previous year. Protest were well established, as was rising criticism of Bush policy both foreign and domestic. Thus, we would expect to find significant differences in content emphasis, and in tone. Moreover, since 2006 was to be a mid-tem election year, the congressional, journalistic, and public audiences were all mindful of the implications of this speech related to the effect it might have on this election.

Page 52 of 59

Keep these considerations in mind as we explore the differences and commonalities in the two speeches. Emotionality – Rationality An initial area of interest is the degree to which the style of the two speeches stays consistent or changes. The ’05 speech reflected the President’s confidence in his ability to bring the Iraq war to, if not to an end, movement toward his promised direction. He advanced a number of favorite domestic priorities which he was confident would be accepted and achieved. By 2006 much had changed. Domestically, his plan for Social Security privatization which he had so glowingly announced was dead in the water. While an elected government was developing to schedule in Iraq, little progress and in fact regression,was evident with respect to sectarian and insurgent violence. American casualties continued to spike, and the public was showing serious resistance to the confident assertions of the Administration. Given these circumstances a reasonable inference is that the EMOTIONAL TONE (defined here as the Ratio of Emotional to Rational Statements) will be significantly greater for the 2006 speech then was the case in 2005. Using all of the Type classifications, a comparison between the frequencies of EMO-CONCLUS (Emotional Conclusions) and all other categories makes it very evident that the increase in this category was the largest and most significant of all changes between the two speeches. This is further reinforced by the decline in FACTUAL statements occurring in the two speeches.

Page 53 of 59

This table displays all categories of statement types. If we restrict the categories to “EMOTIONAL,” RATIONAL,” (F-CONCLUS + FACTUAL) and “OTHER” the major shift to emotionalism in the 2006 speech becomes evident.

Page 54 of 59

The following point to the major shift from RATIONAL to EMOTIONAL statements: • EMOTIONAL statements changed from 15.8% of all of the ’05 statements to 37.6% of all ’06 Statements. • Of all EMOTIONAL statements made in the combined speeches, 72.5% of them were made in the ’06 Speech. Drilling Down for Deeper Insight When we inspect Year, Subject, and Type Categories, there is a great deal we can discover just by trying various combinations in contingency tables. The combination of two table below, comparing the proportions of only EMOTIONAL statements compared between those statements having a subject of DOMESTIC matters vs. FOREIGN AFFAIRS reveals what may have been an unexpected trend.

Page 55 of 59

Here are some of the inferences supported by a side by side inspection of the two tables. • There was a significant shift between 2005 and 2006, with the emotional component being substantially higher in 2006 than was the case in the earlier year. • Than there was an unexpected trend, is demonstrated by the fact that DOMESTIC statements having an emotional component were greater than was the case for FOREIGN AFFAIRS. Sliding Granularity Inspecting the above we see a shift in the toward emotionality in the tone of the ’06 speech. While the p value of .066 is approaching an acceptable level of significance (.05), we don’t yet know which of the subject categories accounts for this change. Thus, we want to get to sufficient detail to tell us which category(s) account for this shift to emotionality. The Category, “FOREIGN AFFAIRS” is quite easy to split into its respective elements, since there it is composed of only two – “WAR” and “INTERNATIONAL.” “WAR” refers to military actions being taken in either Iraq or Afghanistan. “INTERNATIONAL” refers to all other references to relations with foreign countries. Selecting the appropriate categories composing the ‘DOMESTIC” category is more complex, since it is composed of seven (MONEY, VALUES, SECURITY, LAW, YOUTH, ENERGY, and CONGRESS) elements. To use all of them, would Page 56 of 59

defeat the purpose of determining which among them was most affected by this shift. As shown in the table below, only two (MONEY and VALUES) exceed 10% of the combined statements for both years.

Thus we assemble contingency tables for each of the 4 variables, looking for the p values of each, First, the Main Table, showing the shift toward emotional statements. Note the p value which clearly establishes the shift toward EMOTIONAL in the ’06 speech.

Next we compare the change for ’06 between DOMESTIC and FOREIGN AFFAIRS.

Page 57 of 59

Interpreting the above becomes a bit complicated, but if you follow the process, it should become clear how to correctly read the results: • A shift in tone toward the EMOTIONAL was determined in the comparison of ‘05/’06 • Therefore, in the table above, we search for the EMOTIONAL cell in which is contained the largest positive residual. This cell, as highlighted, is the EMOTIONAL/DOMESTIC cell. • Since the p value of .0245 < .05 (the minimum acceptable value) we are assured that it meets our criterion for this change being statistically significant. Having determined that the DOMESTIC category cells to which we need to direct are attention to the two components of which it consists, MONEY and VALUES. We want to limit our analysis to determine which of these two have the greater significance. This will be determined by their respective p values.

Page 58 of 59

Comparing the two tables, it is immediately evident that MONEY is the element of DOMESTIC policy which saw the greatest increase in EMOTIONAL statements. We have three ways of confirming the relative weight of the two elements. • The total number of MONEY statements (148) exceeded the VALUES statements (68) by a full 70 sentences. Clearly , MONEY statements received more attention than did VALUES. • The MONEY/EMOTIONAL cell has a positive residual 2.3 times greater than that of Money, suggesting that the power of EMOTIONAL statements was far greater for MONEY, than for VALUES. • Finally, the Chi-Square value for MONEY was far greater (21.54) than was that of VALUES (11.54) with the subsequent p value for MONEY< VALUES (p ≤ .0001 is less than p = .0031). Summarizing the MONEY – VALUES Findings o The first question of interest was whether there was a difference between the ’05 and ’06 speeches. A trend in the direction of more emotional statements being made in the ’06 speech was identified. o Two general categories of statement had been previously identified – DOMESTIC and FOREIGN AFFAIRS. The largest positive shift in emotional statements was found to be in the category, DOMESTIC. o Finally, the DOMESTIC category held two significant elements, VALUES and MONEY

Page 59 of 59

Data Mining White Paper

Overview

More details

Related Documents

Data Mining White Paper

Data Mining

Data Mining

Data Mining

Data Mining

Data Mining

More Documents from ""

Independent Contractor Qualification

Economic Games

War In The Workplace -toc & Forward

Explain This, If You Can!

Timing Is Everything

My Resume