Building Lexicon

  • Uploaded by: ysrgrathe
  • 0
  • 0
  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Building Lexicon as PDF for free.

More details

  • Words: 736
  • Pages: 19
Building Lexicon for Sentiment Analysis from Massive Collection of HTML Documents Kaji, N. & Kitsuregawa, M. 2007 EMNLP-CoNLL Conference Tyler Thornblade

Introduction Creation of high-quality polar phrase lexicon Fully automatic approach Sacrifice recall for precision Make up for low recall by using an enormous

corpus

General Approach Start with a large

HTML corpus Find polar sentences Extract polar phrases from polar sentences Analyze polar phrases and add to lexicon

Sentence extraction: Syntactic Clues

Manually created list of cue phrases (single

underline) Automatic detection of polar sentences (double underline) based on syntactic constraints

Sentence extraction: Layout Structure

Heuristics for extracting polar data from

itemized lists
    ,
      Heuristics for extracting polar data from various kinds of tabular data

      Evaluation of polar sentence corpus 500 sentences selected Two annotators evaluated whether

      sentences were polar or non-polar Annotator A precision 91.4% Annotator B precision 92.0% Inter-annotator agreement 93.5% Kappa 0.90

      Most errors due to lack of context E.g. “There is much information” marked as

      positive

      Polar phrase extraction Extract phrase candidates from polar

      sentences using structural clues Count occurrences in positive and negative sentences Uses known cue phrase list (with modifiers for

      negations) Drop counts of phrases not in the main clause  E.g. “Although the price is high, the shape is

      beautiful”

      Drop counts of phrases that appear less than

      three times in total in both positive and negative sentences

      Polar phrase evaluation, Chi-square First create a table of frequencies

      Evaluate using Chi-square

      Polar phrase evaluation, PMI Reuse table of frequencies

      Evaluate using PMI

      Polar phrase evaluation, finish Finally, for both Chi-square and PMI, use

      configurable threshold PV > theta, positive phrase PV < -theta, negative phrase

      By adjusting theta, we can balance recall vs.

      precision

      Evaluation of lexicon Pulled a list of 500 adjective phrases

      randomly selected from Web After removing parse errors and duplicates,

      405 unique phrases No overlap with development set Balance: 158 positive, 150 negative, 97 neutral  Based on human annotation  Two annotators, Kappa 0.73

      Baseline: Turney 2002, co-occurrence in a

      window

      Turney used “excellent” and “poor”, they use

      最高 “ best” and 最低 “ worst”

      Evaluation of lexicon

      Evaluation of lexicon

      Direct analysis of lexicon Human analysis of 200 items from lexicon Two annotators, average precision 71.3%

      Kappa 0.66

      Error analysis Turney method had trouble with neutral

      sentences (37 out of 48 errors) Good performance on colloquial phrases (e.g. dasai) not commonly found in dictionary/thesaurus Lexicon captured a lot of non-adjectival data of interest It is hard to receive the effect グラフィックが綺麗だ The graphics are pretty 手入れが楽だ It is easy to maintain 影響を受け難い

      Subjective responses Sentence level analysis Only analyzed 0.1% of sentence corpus “Most” errors due to context, so by a looser

      standard precision may be significantly higher than 92%

      Phrase level analysis Rigorous How hard did they try with Turney? Bad

      results. Picked an easy target (adjectival phrases) Human analysis only analyzed 2% of lexicon

      Closing thoughts Best ideas Very high precision, low recall, and crunch a

      lot of data Page layout cues for extracting data Domain independent Applicable to other langauges (See Cem’s Kanayama et al discussion) Method captured a lot of nouns and verbs, even though they didn’t evaluate this aspect

      Related works Contrasting works  Words with Attitude, Kamps, J. and Marx, M. (Danielle, LEX) [Thesaurus] Seed word based approaches  Automatic Seed Word Selection for Unsupervised Sentiment Classification of Chinese Text, Zagibalov, T. & Carroll, J. (Tyler, Clues & Class)  Identifying and Analyzing Judgment Opinions, Kim, S. & Hovy, E. (Matt OES)  Turney, P. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews (Michael, Docclass) Supervised approach to polar phrase identification  Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis, Wilson, T., Wiebe, J., Hoffmann, P (Michael, Clues & Class)  Extracting Aspect-Evaluation,Kobayashi, N., Inui, K. &

      Related works (continued) Syntactic patterns  Detection of Users' Wants and Needs from Opinions, Kanayama, H. & Nasukawa, T. (Cem, Apps)  Deeper Sentiment Analysis Using Machine Translation Technology, Hiroshi Kanayama and Tetsuya Nasukawa and Hideo Watanabe (Shilpa, Multi) Takes advantage of syntactic patterns but uses a very different (MT) approach. See also Kanayama and Nasukawa 2006 for enhancements to Turney’s window-based approach to cooccurrence detection.

      Layout patterns  Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews (Not read, in list)  Kim, S. & Hovy, E. 2006. Automatic Identification of Pro and Con Reasons in Online Reviews (Yaw, OD)

Related Documents

Building Lexicon
November 2019 16
Brief Lexicon
June 2020 10
Sumerian Lexicon
October 2019 49
Building
July 2020 20
Building
April 2020 34

More Documents from "hosein"

Finding Subjectivity Clues
November 2019 12
Building Lexicon
November 2019 16
Multi_obj
November 2019 8
Whitepaper
November 2019 34