Print Delen Dursun

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Print Delen Dursun as PDF for free.

More details

  • Words: 2,461
  • Pages: 14
Analysis of Research Literature with Text Mining By Dursun Delen, Ph.D. Spears School of Business Oklahoma State University

Copyright © 2007, SAS Institute Inc. All rights reserved.

Outline ƒ Data Mining ƒ Text Mining • A Process Approach

ƒ Case Study: Text Mining for Literature Review ƒ Results (tabular and graphical) ƒ Last Comments

Copyright © 2007, SAS Institute Inc. All rights reserved.

Copyright © 2007, SAS Institute Inc. All rights reserved.

1

What Is Data Mining? ƒ Data mining (knowledge discovery in databases): • Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information from data

ƒ Alternative names… • Data mining: a misnomer? • Knowledge discovery in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Copyright © 2007, SAS Institute Inc. All rights reserved.

Data Mining Motivation: “Necessity is the Mother of Invention” ƒ Data explosion problem • Why do we have more data now, then we had before? − Technology driven reasons: Automated data collection tools… − Software driven reasons: Mature database technology… − Cost driven reasons

ƒ We are drowning in data, but starving for knowledge! ƒ Solution: Data & text mining

Copyright © 2007, SAS Institute Inc. All rights reserved.

Copyright © 2007, SAS Institute Inc. All rights reserved.

2

Why Text Mining? ƒ We have much more unstructured data “85 to 90 percent of all corporate data is stored in some sort of unstructured form (i.e., as text)” - Merrill Lynch and Gartner

Copyright © 2007, SAS Institute Inc. All rights reserved.

Text Mining ƒ What is text mining ƒ Strengths ƒ Weaknesses / Deficiencies ƒ On what kind of data ƒ Terminology of text mining ƒ Text mining PROCESS ƒ Application

Copyright © 2007, SAS Institute Inc. All rights reserved.

Copyright © 2007, SAS Institute Inc. All rights reserved.

3

Definition of Text Mining ƒ Text mining is a process that employs Statistical Natural Language Processing

• a set of algorithms for converting unstructured text into structured data objects, and • the quantitative methods that analyze these data objects to discover knowledge

Data Mining

ƒ Text Mining = Statistical NLP + DM

Copyright © 2007, SAS Institute Inc. All rights reserved.

Text Mining Strengths ƒ Clustering documents in a corpus • Corpus: A large collection of writings of a specific subject

ƒ Investigating time-variant word distribution across documents within a corpus ƒ Identifying words with the highest discriminatory power ƒ Classifying documents into predefined categories (e.g., good vs. bad comments) ƒ Integrating text data with structured data to enrich predictive modeling endeavors Copyright © 2007, SAS Institute Inc. All rights reserved.

Copyright © 2007, SAS Institute Inc. All rights reserved.

4

Text Mining Deficiencies ƒ Text mining algorithms perform poorly in distinguishing negations, for example: • Herman was involved in a motor vehicle accident. • Herman was NOT involved in a motor vehicle accident.

ƒ Text mining algorithms do not work well with large documents. • Performance is slow. • Increased term occurrence across documents decreases separation of documents. • Works best with shorter documents with large quantities

Copyright © 2007, SAS Institute Inc. All rights reserved.

Text Mining Data ƒ Textual data with one document per record (one file) • Tool specific spreadsheet

ƒ Documents stored as individual files in a well-defined file systems (multiple files, one document per file). • MS Word files • PDF files • TEXT or XML files Copyright © 2007, SAS Institute Inc. All rights reserved.

Copyright © 2007, SAS Institute Inc. All rights reserved.

5

Text Mining Terminology ƒ Corpus / Corpora ƒ Word vs. Term ƒ Stemming (model, models, modeling) ƒ Stop words (a, an, the, I, he, …) ƒ Synonyms, homonyms, acronyms ƒ Term-by-document matrix ƒ Word frequency ƒ Singular value decomposition Copyright © 2007, SAS Institute Inc. All rights reserved.

Text Mining Applications ƒ Descriptive modeling – Clustering / Association rule mining • Clustering of similar documents • Identification of terms that go together • WWW search engines

ƒ Predictive modeling - Classification • Classification of documents into predefined classes • Email classification • Genome classification

ƒ …

Copyright © 2007, SAS Institute Inc. All rights reserved.

Copyright © 2007, SAS Institute Inc. All rights reserved.

6

IDEFØ Function Modeling Method CONTROLS (noun-phrase)

INPUTS (noun-phrase)

An Activity or a Function (verb-phrase) A0

OUTPUT (noun-phrase)

CALL to other DIAGRAMS MECHANISMS (noun-phrase)

Copyright © 2007, SAS Institute Inc. All rights reserved.

Text Mining – A Process View - 1 Software/hardware limitations Privacy issues Linguistic limitations

Unstructured data (text) Structured data (databases)

Extract knowledge from available data sources A0

Context-specific knowledge

Domain expertise Tools and techniques

Copyright © 2007, SAS Institute Inc. All rights reserved.

Copyright © 2007, SAS Institute Inc. All rights reserved.

7

Text Mining – A Process View - 2 C1 C2 Software/hardware limitations Privacy issues

I1

Unstructured data (text)

Collect related documents

C3 Linguistic limitations

Organized document list

A1 Process documents

Structured (numerisized) documents

A2 Context-specific knowledge

I2

Extract desired knowledge

Structured data (databases)

O1

A3

Domain expertise Tools and techniques M1

M2

Copyright © 2007, SAS Institute Inc. All rights reserved.

Text Mining – A Process View - 3 C1 Software/hardware limitations

I1

Organized document list

Build stop (or include) word list A21

C2 Linguistic limitations

Domain specific stop (or include) word list

Numericize documents

Structured (numerisized) documents

O1

Word-by-document matrix

A22

Reduce word-by-document matrix A23

Domain expertise M2

Tools and techniques M1

Copyright © 2007, SAS Institute Inc. All rights reserved.

Copyright © 2007, SAS Institute Inc. All rights reserved.

8

Text Mining Process Map Unstructured Text Problems / Opportunities - Environmental scanning

Document Structuring - Extract - Transform - Load

- Documents (.doc, .pdf, .ps, etc.) - Fragments (sentences, paragraphs, speech encodings, etc.) - Web pages (textual sections, XML files, etc.) - Etc.

Term Generation - Natural Language Processing - Term Filtering: use of domain knowledge (stemming, stop word list, synonym list, acronym list, etc.)

Structured Text - Tabular representation - Collection of XML files - Etc.

Feedback

Term by Document Matrix T_1 T_2

Feedback

Doc 1

1

Doc 2

1

...

T_N

2

...

3

Doc M

1

Information / Knowledge - Clustering - Classification - Association - Link analysis

1

Copyright © 2007, SAS Institute Inc. All rights reserved.

Analysis of Research Literature with Text Mining - Data Collection ƒ Text mining of IS journals • ISR (Information Systems Research) • MISQ (MIS Quarterly) • JMIS (Journal of MIS) Number of Articles

Dates of Publication

Volumes/Numbers Included

MIS Quarterly (MISQ)

246

1994 – 2005

18/1 – 29/3

Information Systems Research (ISR)

253

1994 – 2005

5/1 – 16/3

Journal of MIS (JMIS)

402

1994 – 2005

10/4 – 22/1

901

12 yrs

Journal Name

Total

Copyright © 2007, SAS Institute Inc. All rights reserved.

Copyright © 2007, SAS Institute Inc. All rights reserved.

9

Analysis of Research Literature with Text Mining - Data Representation Journal Year

Author(s)

MISQ

2005

A. Malhotra, S. Gosain and O. A. El Sawy

Title

ISR

1999

JMIS

2001

R. Aron and E. K. Clemons







Vol/No Pages

Keywords

Abstract

145-187 knowledge management supply chain absorptive capacity interorganizational information systems configuration approaches 2-Oct 167-185 organizational transformation impacts of technology organization theory research methodology intraorganizational power electronic communication mis implementation culture systems information products Achieving the optimal 18/2 65-88 internet advertising balance between product positioning investment in quality signaling and investment in selfsignaling games promotion for information products

Absorptive capacity configurations in supply chains: Gearing for partnerenabled market knowledge creation D. Robey and Accounting for the M. C. Boudreau contradictory organizational consequences of information technology: Theoretical directions and methodological implications



29/1





The need for continual value innovation is driving supply chains to evolve from a pure transactional focus to leveraging interorganizational partner ships for sharing Although much contemporary thought considers advanced information technologies as either determinants or enablers of radical organizational change, empirical studies have revealed inconsistent findings to support the deterministic logic implicit in such arguments. This paper reviews the contradictory When producers of goods (or services) are confronted by a situation in which their offerings no longer perfectly match consumer preferences, they must determine the extent to which the advertised features of





Copyright © 2007, SAS Institute Inc. All rights reserved.

Analysis of Research Literature with Text Mining - Frequency Results Information Systems Research (ISR) 1994-1996 Key words model process user decision time structure performance organizational development business perspective construct change method implementation

1997-1999 Fr. 29 26 18 16 14 14 13 13 12 12 11 11 11 10 10

Key words model process empirical framework electronic development structure performance case business set method organizational industry environment

2000-2002 Fr. 23 21 18 16 16 16 15 15 15 15 13 13 12 12 12

Key words model process empirical business construct user factor framework development benefit set performance organizational metric individual

2003-2005 Fr. 38 29 23 23 20 18 17 16 13 13 12 12 12 12 12

Key words model process decision user organizational optimal environment development change value performance time empirical cost task

Fr. 28 16 15 13 12 12 12 12 12 11 11 10 10 10 9

Copyright © 2007, SAS Institute Inc. All rights reserved.

Copyright © 2007, SAS Institute Inc. All rights reserved.

10

Analysis of Research Literature with Text Mining - Frequency Results The Journal of MIS (JMIS) 1994-1996 Key words model process business development user management factor implementation performance change structure electronic cost benefit strategic

1997-1999 Fr. 34 33 27 23 22 21 21 18 17 17 16 16 16 16 15

Key words process model organizational development decision support management user business case structure framework communication change variable

2000-2002 Fr. 40 39 28 27 24 24 23 22 22 20 18 18 17 17 16

Key words model process management market organizational empirical performance practice development time factor business structure cost perspective

2003-2005 Fr. 47 45 33 26 25 23 22 21 20 19 19 19 18 18 17

Key words model process value business performance factor development quality empirical product organizational framework user project tool

Fr. 52 33 24 23 22 22 22 20 19 18 16 15 14 14 13

Copyright © 2007, SAS Institute Inc. All rights reserved.

Analysis of Research Literature with Text Mining - Frequency Results MIS Quarterly (MISQ) 1994-1996 Key words development process model business change manager management factor benefit value user practice case organizational electronic

1997-1999 Fr. 23 19 19 19 18 16 16 16 14 13 13 13 13 12 12

Key words model organizational case process management manager empirical user performance decision change perspective managerial business time

2000-2002 Fr. 24 19 18 17 17 16 16 14 14 14 14 13 13 13 12

Key words model organizational management process manager factor behavior software development construct case time performance user empirical

2003-2005 Fr. 26 22 17 14 14 14 14 12 12 12 12 11 11 10 10

Key words model organizational management process practice user behavior performance individual action resource factor value empirical business

Fr. 34 24 23 22 18 15 15 14 14 14 13 13 12 12 12

Copyright © 2007, SAS Institute Inc. All rights reserved.

Copyright © 2007, SAS Institute Inc. All rights reserved.

11

Analysis of Research Literature with Text Mining - Clustering Results Cluster Number

Article Frequency (Count) (Percent)

Descriptive Terms (Top 10 Key words)

1

Error, Discipline, MIS, Major, Methodology, Field, Value, Time, Future, Set

54

0.06

2

Network, Policy, Cost, Market, Team, Industry, Resource, Manager, Product, Structure

89

0.10

3

Executive, Financial, Competitive, Industry, Advantage, Investment, Market, Management, Business, Performance

87

0.10

4

Consumer, Price, Product, Customer, Online, Service, Site, Quality, Market, Cost

58

0.06

5

User, Database, Instrument, Model, Validation, Field, Construct, Development, Computer, Individual

121

0.13

6

Strategic, Innovation, Nature, Survey, Success, Management, Investment, Practice, Business, Empirical

56

0.06

7

Medium, Computer-Mediated, GSS, Communication, Task, Face-To-Face, Laboratory, Interaction, Idea, Social

60

0.07

8

Process, Change, Organizational, Practice, Case, Management, Method, Tool, Framework, Base

181

0.20

9

Risk, Project, Construct, Software, Investment, Factor, Manager, Behavior, Development, Value

195

0.22

901

100

Total

Copyright © 2007, SAS Institute Inc. All rights reserved.

Analysis of Research Literature with Text Mining - Clusters versus Journals 100 90 80 70 60 50 40 30 20 10 0 ISR

JMIS

MISQ

No of Articles

CLUSTER: 1

ISR

JMIS

MISQ

CLUSTER: 2

ISR

JMIS

MISQ

CLUSTER: 3

100 90 80 70 60 50 40 30 20 10 0 ISR

JMIS

MISQ

CLUSTER: 4

ISR

JMIS

MISQ

CLUSTER: 5

ISR

JMIS

MISQ

CLUSTER: 6

100 90 80 70 60 50 40 30 20 10 0 ISR

JMIS

MISQ

CLUSTER: 7

ISR

JMIS

MISQ

CLUSTER: 8

ISR

JMIS

MISQ

CLUSTER: 9

JOURNAL

Copyright © 2007, SAS Institute Inc. All rights reserved.

Copyright © 2007, SAS Institute Inc. All rights reserved.

12

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

CLUSTER: 4

CLUSTER: 5

CLUSTER: 6

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

CLUSTER: 3

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

CLUSTER: 2

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

35 30 25 20 15 10 5 0

CLUSTER: 1

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

35 30 25 20 15 10 5 0

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

No of Articles

35 30 25 20 15 10 5 0

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

Analysis of Research Literature with Text Mining - Temporal Profiles of the Clusters

CLUSTER: 7

CLUSTER: 8

CLUSTER: 9

YEAR Copyright © 2007, SAS Institute Inc. All rights reserved.

Comments / Lessons Learned… ƒ Text mining is a very good tool in automating research literature review • At least as a first pass

ƒ It should be taken as it is – not a magic! ƒ It can be applied to other literature rich domains, such as medicine (esp. to genome project) ƒ Need to incorporate more and more of NLP

Copyright © 2007, SAS Institute Inc. All rights reserved.

Copyright © 2007, SAS Institute Inc. All rights reserved.

13

Thank you for your attention! ƒ Questions? ƒ Comments! ƒ Suggestions.

Reference: Delen, D. & M. Crossland (2007). “Seeding the survey and analysis of research literature with text mining” to appear in Expert Systems with Applications, currently available on the digital database. Copyright © 2007, SAS Institute Inc. All rights reserved.

Copyright © 2007, SAS Institute Inc. All rights reserved.

Copyright © 2007, SAS Institute Inc. All rights reserved.

14

Related Documents

Print Delen Dursun
November 2019 6
Dursun Kahraman
November 2019 11
Presentatie Delen
May 2020 6
Dursun Karaman-1
November 2019 13