Analysis of Research Literature with Text Mining By Dursun Delen, Ph.D. Spears School of Business Oklahoma State University
Copyright © 2007, SAS Institute Inc. All rights reserved.
Outline Data Mining Text Mining • A Process Approach
Case Study: Text Mining for Literature Review Results (tabular and graphical) Last Comments
Copyright © 2007, SAS Institute Inc. All rights reserved.
Copyright © 2007, SAS Institute Inc. All rights reserved.
1
What Is Data Mining? Data mining (knowledge discovery in databases): • Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information from data
Alternative names… • Data mining: a misnomer? • Knowledge discovery in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
Copyright © 2007, SAS Institute Inc. All rights reserved.
Data Mining Motivation: “Necessity is the Mother of Invention” Data explosion problem • Why do we have more data now, then we had before? − Technology driven reasons: Automated data collection tools… − Software driven reasons: Mature database technology… − Cost driven reasons
We are drowning in data, but starving for knowledge! Solution: Data & text mining
Copyright © 2007, SAS Institute Inc. All rights reserved.
Copyright © 2007, SAS Institute Inc. All rights reserved.
2
Why Text Mining? We have much more unstructured data “85 to 90 percent of all corporate data is stored in some sort of unstructured form (i.e., as text)” - Merrill Lynch and Gartner
Copyright © 2007, SAS Institute Inc. All rights reserved.
Text Mining What is text mining Strengths Weaknesses / Deficiencies On what kind of data Terminology of text mining Text mining PROCESS Application
Copyright © 2007, SAS Institute Inc. All rights reserved.
Copyright © 2007, SAS Institute Inc. All rights reserved.
3
Definition of Text Mining Text mining is a process that employs Statistical Natural Language Processing
• a set of algorithms for converting unstructured text into structured data objects, and • the quantitative methods that analyze these data objects to discover knowledge
Data Mining
Text Mining = Statistical NLP + DM
Copyright © 2007, SAS Institute Inc. All rights reserved.
Text Mining Strengths Clustering documents in a corpus • Corpus: A large collection of writings of a specific subject
Investigating time-variant word distribution across documents within a corpus Identifying words with the highest discriminatory power Classifying documents into predefined categories (e.g., good vs. bad comments) Integrating text data with structured data to enrich predictive modeling endeavors Copyright © 2007, SAS Institute Inc. All rights reserved.
Copyright © 2007, SAS Institute Inc. All rights reserved.
4
Text Mining Deficiencies Text mining algorithms perform poorly in distinguishing negations, for example: • Herman was involved in a motor vehicle accident. • Herman was NOT involved in a motor vehicle accident.
Text mining algorithms do not work well with large documents. • Performance is slow. • Increased term occurrence across documents decreases separation of documents. • Works best with shorter documents with large quantities
Copyright © 2007, SAS Institute Inc. All rights reserved.
Text Mining Data Textual data with one document per record (one file) • Tool specific spreadsheet
Documents stored as individual files in a well-defined file systems (multiple files, one document per file). • MS Word files • PDF files • TEXT or XML files Copyright © 2007, SAS Institute Inc. All rights reserved.
Copyright © 2007, SAS Institute Inc. All rights reserved.
5
Text Mining Terminology Corpus / Corpora Word vs. Term Stemming (model, models, modeling) Stop words (a, an, the, I, he, …) Synonyms, homonyms, acronyms Term-by-document matrix Word frequency Singular value decomposition Copyright © 2007, SAS Institute Inc. All rights reserved.
Text Mining Applications Descriptive modeling – Clustering / Association rule mining • Clustering of similar documents • Identification of terms that go together • WWW search engines
Predictive modeling - Classification • Classification of documents into predefined classes • Email classification • Genome classification
…
Copyright © 2007, SAS Institute Inc. All rights reserved.
Copyright © 2007, SAS Institute Inc. All rights reserved.
6
IDEFØ Function Modeling Method CONTROLS (noun-phrase)
INPUTS (noun-phrase)
An Activity or a Function (verb-phrase) A0
OUTPUT (noun-phrase)
CALL to other DIAGRAMS MECHANISMS (noun-phrase)
Copyright © 2007, SAS Institute Inc. All rights reserved.
Text Mining – A Process View - 1 Software/hardware limitations Privacy issues Linguistic limitations
Unstructured data (text) Structured data (databases)
Extract knowledge from available data sources A0
Context-specific knowledge
Domain expertise Tools and techniques
Copyright © 2007, SAS Institute Inc. All rights reserved.
Copyright © 2007, SAS Institute Inc. All rights reserved.
7
Text Mining – A Process View - 2 C1 C2 Software/hardware limitations Privacy issues
I1
Unstructured data (text)
Collect related documents
C3 Linguistic limitations
Organized document list
A1 Process documents
Structured (numerisized) documents
A2 Context-specific knowledge
I2
Extract desired knowledge
Structured data (databases)
O1
A3
Domain expertise Tools and techniques M1
M2
Copyright © 2007, SAS Institute Inc. All rights reserved.
Text Mining – A Process View - 3 C1 Software/hardware limitations
I1
Organized document list
Build stop (or include) word list A21
C2 Linguistic limitations
Domain specific stop (or include) word list
Numericize documents
Structured (numerisized) documents
O1
Word-by-document matrix
A22
Reduce word-by-document matrix A23
Domain expertise M2
Tools and techniques M1
Copyright © 2007, SAS Institute Inc. All rights reserved.
Copyright © 2007, SAS Institute Inc. All rights reserved.
8
Text Mining Process Map Unstructured Text Problems / Opportunities - Environmental scanning
Document Structuring - Extract - Transform - Load
- Documents (.doc, .pdf, .ps, etc.) - Fragments (sentences, paragraphs, speech encodings, etc.) - Web pages (textual sections, XML files, etc.) - Etc.
Term Generation - Natural Language Processing - Term Filtering: use of domain knowledge (stemming, stop word list, synonym list, acronym list, etc.)
Structured Text - Tabular representation - Collection of XML files - Etc.
Feedback
Term by Document Matrix T_1 T_2
Feedback
Doc 1
1
Doc 2
1
...
T_N
2
...
3
Doc M
1
Information / Knowledge - Clustering - Classification - Association - Link analysis
1
Copyright © 2007, SAS Institute Inc. All rights reserved.
Analysis of Research Literature with Text Mining - Data Collection Text mining of IS journals • ISR (Information Systems Research) • MISQ (MIS Quarterly) • JMIS (Journal of MIS) Number of Articles
Dates of Publication
Volumes/Numbers Included
MIS Quarterly (MISQ)
246
1994 – 2005
18/1 – 29/3
Information Systems Research (ISR)
253
1994 – 2005
5/1 – 16/3
Journal of MIS (JMIS)
402
1994 – 2005
10/4 – 22/1
901
12 yrs
Journal Name
Total
Copyright © 2007, SAS Institute Inc. All rights reserved.
Copyright © 2007, SAS Institute Inc. All rights reserved.
9
Analysis of Research Literature with Text Mining - Data Representation Journal Year
Author(s)
MISQ
2005
A. Malhotra, S. Gosain and O. A. El Sawy
Title
ISR
1999
JMIS
2001
R. Aron and E. K. Clemons
…
…
…
Vol/No Pages
Keywords
Abstract
145-187 knowledge management supply chain absorptive capacity interorganizational information systems configuration approaches 2-Oct 167-185 organizational transformation impacts of technology organization theory research methodology intraorganizational power electronic communication mis implementation culture systems information products Achieving the optimal 18/2 65-88 internet advertising balance between product positioning investment in quality signaling and investment in selfsignaling games promotion for information products
Absorptive capacity configurations in supply chains: Gearing for partnerenabled market knowledge creation D. Robey and Accounting for the M. C. Boudreau contradictory organizational consequences of information technology: Theoretical directions and methodological implications
…
29/1
…
…
The need for continual value innovation is driving supply chains to evolve from a pure transactional focus to leveraging interorganizational partner ships for sharing Although much contemporary thought considers advanced information technologies as either determinants or enablers of radical organizational change, empirical studies have revealed inconsistent findings to support the deterministic logic implicit in such arguments. This paper reviews the contradictory When producers of goods (or services) are confronted by a situation in which their offerings no longer perfectly match consumer preferences, they must determine the extent to which the advertised features of
…
…
Copyright © 2007, SAS Institute Inc. All rights reserved.
Analysis of Research Literature with Text Mining - Frequency Results Information Systems Research (ISR) 1994-1996 Key words model process user decision time structure performance organizational development business perspective construct change method implementation
1997-1999 Fr. 29 26 18 16 14 14 13 13 12 12 11 11 11 10 10
Key words model process empirical framework electronic development structure performance case business set method organizational industry environment
2000-2002 Fr. 23 21 18 16 16 16 15 15 15 15 13 13 12 12 12
Key words model process empirical business construct user factor framework development benefit set performance organizational metric individual
2003-2005 Fr. 38 29 23 23 20 18 17 16 13 13 12 12 12 12 12
Key words model process decision user organizational optimal environment development change value performance time empirical cost task
Fr. 28 16 15 13 12 12 12 12 12 11 11 10 10 10 9
Copyright © 2007, SAS Institute Inc. All rights reserved.
Copyright © 2007, SAS Institute Inc. All rights reserved.
10
Analysis of Research Literature with Text Mining - Frequency Results The Journal of MIS (JMIS) 1994-1996 Key words model process business development user management factor implementation performance change structure electronic cost benefit strategic
1997-1999 Fr. 34 33 27 23 22 21 21 18 17 17 16 16 16 16 15
Key words process model organizational development decision support management user business case structure framework communication change variable
2000-2002 Fr. 40 39 28 27 24 24 23 22 22 20 18 18 17 17 16
Key words model process management market organizational empirical performance practice development time factor business structure cost perspective
2003-2005 Fr. 47 45 33 26 25 23 22 21 20 19 19 19 18 18 17
Key words model process value business performance factor development quality empirical product organizational framework user project tool
Fr. 52 33 24 23 22 22 22 20 19 18 16 15 14 14 13
Copyright © 2007, SAS Institute Inc. All rights reserved.
Analysis of Research Literature with Text Mining - Frequency Results MIS Quarterly (MISQ) 1994-1996 Key words development process model business change manager management factor benefit value user practice case organizational electronic
1997-1999 Fr. 23 19 19 19 18 16 16 16 14 13 13 13 13 12 12
Key words model organizational case process management manager empirical user performance decision change perspective managerial business time
2000-2002 Fr. 24 19 18 17 17 16 16 14 14 14 14 13 13 13 12
Key words model organizational management process manager factor behavior software development construct case time performance user empirical
2003-2005 Fr. 26 22 17 14 14 14 14 12 12 12 12 11 11 10 10
Key words model organizational management process practice user behavior performance individual action resource factor value empirical business
Fr. 34 24 23 22 18 15 15 14 14 14 13 13 12 12 12
Copyright © 2007, SAS Institute Inc. All rights reserved.
Copyright © 2007, SAS Institute Inc. All rights reserved.
11
Analysis of Research Literature with Text Mining - Clustering Results Cluster Number
Article Frequency (Count) (Percent)
Descriptive Terms (Top 10 Key words)
1
Error, Discipline, MIS, Major, Methodology, Field, Value, Time, Future, Set
54
0.06
2
Network, Policy, Cost, Market, Team, Industry, Resource, Manager, Product, Structure
89
0.10
3
Executive, Financial, Competitive, Industry, Advantage, Investment, Market, Management, Business, Performance
87
0.10
4
Consumer, Price, Product, Customer, Online, Service, Site, Quality, Market, Cost
58
0.06
5
User, Database, Instrument, Model, Validation, Field, Construct, Development, Computer, Individual
121
0.13
6
Strategic, Innovation, Nature, Survey, Success, Management, Investment, Practice, Business, Empirical
56
0.06
7
Medium, Computer-Mediated, GSS, Communication, Task, Face-To-Face, Laboratory, Interaction, Idea, Social
60
0.07
8
Process, Change, Organizational, Practice, Case, Management, Method, Tool, Framework, Base
181
0.20
9
Risk, Project, Construct, Software, Investment, Factor, Manager, Behavior, Development, Value
195
0.22
901
100
Total
Copyright © 2007, SAS Institute Inc. All rights reserved.
Analysis of Research Literature with Text Mining - Clusters versus Journals 100 90 80 70 60 50 40 30 20 10 0 ISR
JMIS
MISQ
No of Articles
CLUSTER: 1
ISR
JMIS
MISQ
CLUSTER: 2
ISR
JMIS
MISQ
CLUSTER: 3
100 90 80 70 60 50 40 30 20 10 0 ISR
JMIS
MISQ
CLUSTER: 4
ISR
JMIS
MISQ
CLUSTER: 5
ISR
JMIS
MISQ
CLUSTER: 6
100 90 80 70 60 50 40 30 20 10 0 ISR
JMIS
MISQ
CLUSTER: 7
ISR
JMIS
MISQ
CLUSTER: 8
ISR
JMIS
MISQ
CLUSTER: 9
JOURNAL
Copyright © 2007, SAS Institute Inc. All rights reserved.
Copyright © 2007, SAS Institute Inc. All rights reserved.
12
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
CLUSTER: 4
CLUSTER: 5
CLUSTER: 6
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
CLUSTER: 3
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
CLUSTER: 2
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
35 30 25 20 15 10 5 0
CLUSTER: 1
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
35 30 25 20 15 10 5 0
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
No of Articles
35 30 25 20 15 10 5 0
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
Analysis of Research Literature with Text Mining - Temporal Profiles of the Clusters
CLUSTER: 7
CLUSTER: 8
CLUSTER: 9
YEAR Copyright © 2007, SAS Institute Inc. All rights reserved.
Comments / Lessons Learned… Text mining is a very good tool in automating research literature review • At least as a first pass
It should be taken as it is – not a magic! It can be applied to other literature rich domains, such as medicine (esp. to genome project) Need to incorporate more and more of NLP
Copyright © 2007, SAS Institute Inc. All rights reserved.
Copyright © 2007, SAS Institute Inc. All rights reserved.
13
Thank you for your attention! Questions? Comments! Suggestions.
Reference: Delen, D. & M. Crossland (2007). “Seeding the survey and analysis of research literature with text mining” to appear in Expert Systems with Applications, currently available on the digital database. Copyright © 2007, SAS Institute Inc. All rights reserved.
Copyright © 2007, SAS Institute Inc. All rights reserved.
Copyright © 2007, SAS Institute Inc. All rights reserved.
14