Whitepaper

Uploaded by: ysrgrathe
0
0

November 2019
PDF

Download

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA

Overview

Download & View Whitepaper as PDF for free.

More details

Words: 762
Pages: 20

Preview
Full text

Ideogram Based Sentiment Analysis in Japanese Text Tyler Thornblade

Introduction Many papers apply similar techniques across

differing languages Two papers in this class introduced a novel technique: assign sentiment at the character (sub-word) level Opinion Extraction, Summarization and

Tracking in News and Blog Corpora, Ku, L., Liang, Y. & Chen, H. AAAI 2006 Symposium. (John) Experiments with sentimental dictionary based classifier and CRF model, Huang, R., Sun, L. & Pan, L. Sixth NTCIR Workshop, 2007. (Cem)

Why are ideograms different? Unlike phonetic characters, ideograms have

innate meanings Are they sentiment bearing?

Example: 気 Spirit Mind Air Mood

Note that ideograms seldom have just one

meaning; more typical to have a synset or group of related synsets

Ku et al. Create a sentiment dictionary General Inquirer, Chinese Network Sentiment

Dictionary Expanded dictionary using thesauri  Tong2yi4ci2ci2lin2 (Mei et al. 1982)  Academia Sinica Bilingual Ontological Wordnet

(Huang et al. 2008)

Performance of Ku et al. & Huang et al. Results were fair but not impressive Neither paper outlined results at the word

level

Hypothesis These techniques will not be as effective in

Japanese as in Chinese Why? Bag-of-words type approach ignores

compositional understanding Japanese uses script in addition to ideograms

A Short Background on Japanese Although linguistically unrelated, Chinese

and Japanese both use Chinese characters extensively Many multi-character compounds in Japanese are borrowings Writing systems Chinese characters (Kanji) Script (Hiragana, Katakana)  Words that mix characters with script (okurigana)  Words that are entirely script (kana)

Japanese Compound Composition Five classes 1. Both characters have the same meaning. 2. The characters have opposite meanings. 3. The top character modifies the bottom

character. 4. The bottom character is the target, direct object, or complement of the top character. 5. The top character negates (“flips”) the meaning of the bottom character. First two are ok, last three could present

problem for Ku et al.

Experiment Start with sentiment dictionary of Kaji and

Kitsuregawa (2007) (Presented by Tyler), => 10,000 words Clean to remove bigrams, trigrams => 2386 words Apply Ku et al. Generate sentiment scores for the 954

Chinese characters Generate sentiment scores for the words in the dictionary Ignore magnitude and score result by comparing sign of Kaji & Kitsuregawa to sign of program output

Caveats This is a proof of concept; there was

insufficient time (and resources) to develop a new sentiment dictionary and/or perform an annotation study Train and Test on same data Results not comparable to other systems Should interpret as an upper bound on

performance of this method  We start with essentially perfect knowledge of the

sentiment value of words  Our results should be near optimal for this method

Results

Oh no! Weren’t we expecting poor results?

Detailed results for characters

Error Analysis 20% of the errors were selected for detailed

analysis 50 false positives 50 false negatives These were further pruned so that only multi-

character compounds were considered

Error Analysis, False Positives 33% of errors

explained by lack of compositional knowledge 6.7% class 5 27% class 3

Error Analysis, False Negatives 54.8% of errors

explained by lack of compositional knowledge 3.2% class 5 51.6% due to “ 的”

Other errors Script characters We can’t analyze words entirely made up of

script  34.7% of all errors were due to this

Words that mix script with characters may

introduce additional noise

Problems with source data After cleaning, the dictionary still contained 4-

5% bigrams Some data from Kaji & Kitsuregawa is unintuitive

 E.g. 無用 and 不用 , both of which mean “useless” yet

received high positive sentiment scores and showed

Evaluation of lexicon Pulled a list of 500 adjective phrases

randomly selected from Web After removing parse errors and duplicates,

405 unique phrases No overlap with development set Balance: 158 positive, 150 negative, 97 neutral  Based on human annotation  Two annotators, Kappa 0.73

Baseline: Turney 2002, co-occurrence in a

window

Turney used “excellent” and “poor”, they use

最高 “ best” and 最低 “ worst”

Conclusions Overall: results were good. As a proof of

concept, this provides support for additional work in this area. Hypothesis was accurate in that approximately 60% of the errors were explainable in terms of missing linguistic knowledge

Next steps Perform a more rigorous study of this nature Use Kaji & Kitsuregawa dictionary and do an

annotation study to show the true performance of this approach Create a better sentiment dictionary and do the same  Kobayashi’s Evaldic might be one resource

Apply compositional features Unclear if lexical data of this nature is

available Apply word-based techniques to script

characters

Questions

Whitepaper

Overview

More details

Related Documents

Whitepaper

Whitepaper

Whitepaper

Whitepaper

Whitepaper

Whitepaper

More Documents from ""

Finding Subjectivity Clues

Building Lexicon

Multi_obj

Whitepaper