Whitepaper

November 2019
PDF

Download

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA

Overview

Download & View Whitepaper as PDF for free.

More details

Words: 2,800
Pages: 7

Preview
Full text

Ideogram Based Sentiment Analysis in Japanese Text Tyler Thornblade

1. INTRODUCTION In recent years a growing body of literature has applied techniques for sentiment analysis to languages other than English. While many of these works apply similar techniques across differing languages, a technique notable for its specificity to ideographic languages has arisen: rather than just considering sentiment at the word, sentence, and document level sentiment is also analyzed at the level of the constituent characters of a word. This technique may be of unique value to ideographic languages for two reasons. First, all multi-character words consist of characters that themselves have distinct meanings (or sets of meanings) (Henshall, 1988). Thus, information on individual characters can inform judgments made on composite words. Second, the concept of a word is not well defined in some ideographic languages and many errors may be introduced by the “word segmentation” step in processing (Zagibalov & Carroll, 2008). Moving to a character-based approach can avoid the need for segmentation. Two recent papers apply this technique to the Chinese language. In this work, we apply the same approach to Japanese. We theorize that techniques for assigning sentiment scores to individual ideograms in order to develop sentiment scores for words will not be as effective in Japanese as they are in Chinese without some significant modifications.

2. RELATED WORKS 2.1. Ku, Liang, and Chen Ku, Liang and Chen (2006) introduce a characterbased technique using a simple approach. They first construct a sentiment dictionary at the word level. In their paper, the base dictionary is constructed from the General Inquirer (Stone et al. 1966) and the Chinese

Network Sentiment Dictionary corpora. They then expand the dictionary using two thesauri, tong2yi4ci2ci2lin2 (Mei et al. 1982) and the Academia Sinica Bilingual Ontological Wordnet (Huang et al. 2008). To develop character sentiment scores, they mine this dictionary seeking counts of the number of positive and negative words in which each character appears. fpci and fnci denote the frequencies of a character ci in the positive and negative words, respectively; n and m denote total number of unique characters in positive and negative words, respectively (Ku et al. 2006).

The score for a character is:

The score for a word is simply the average of the characters in the word:

Thus words with strong positive sentiment will have values much greater than zero, and negative words will likewise have values much less than zero. Words with weak sentiment scores will be found near zero. They use this technique to expand their small seed dictionary to be able to handle novel words that are composed of known characters.

2.2. Huang, Pan and Sun Huang, Pan and Sun (2007) use a very similar approach. As their base dictionary they use the NTCIR emotion dictionary (Kando 2008) and also utilize tong2yi4ci2ci2lin2 as their thesaurus. Their approach is otherwise identical and they cite Ku et al. for this technique. 2.2. Results Both works in this section report reasonable system results for the sentiment analysis task at the sentence or document level, yet neither paper reports the results at the word or phrase level. Therefore, it is hard to establish a baseline for the performance of this phase.

3. HYPOTHESIS The technique of Ku et al. at the word level can be considered analogous to the “bag of words” approach to sentential understanding. Much as “bag of words” ignores syntactic information encoded in the sentence, this simple approach ignores information contained in the composition of multi-character words. We theorize that system performance in Japanese will be limited by this lack of compositional knowledge. 3.1. Linguistic foundation The Japanese language uses three different writing systems: Chinese characters and two phonetic scripts (Jorden and Noda, 1998). A significant portion of the Japanese vocabulary consists of loanwords from Chinese that are written with the same or similar characters. It is compounds of this nature that seem most amenable to the technique of Ku et al. Additionally, there are words which combine Chinese characters with script (referred to as okurigana) and words which are written entirely in script (referred to as kana). These latter words seem less relevant, because unlike Chinese characters the individual script characters do not have independent meanings. Since Chinese does not have any equivalent script, this is a class of words on which performance in Japanese is likely to suffer compared to that of Chinese. In Japanese there are five canonical ways in which characters can be combined to make two-character compounds (Yamamoto, 2008). These are:

1. Both characters have the same meaning. 2. The characters have opposite meanings. 3. The top character.

character

modifies

the

bottom

4. The bottom character is the target, direct object, or complement of the top character. 5. The top character negates (“flips”) the meaning of the bottom character. The first two classes would not seem to present a problem for the “bag of words” approach as the sentiment values will either reinforce or cancel each other out, respectively. However, the remaining three classes may present a problem as each of these classes can result in a word whose meaning is not simply the sum of its parts. 3.2. Experiment In order to evaluate the efficacy of this technique in Japanese, we formulate a simple experiment as a proof of concept. Using the Japanese sentiment dictionary of Kaji and Kitsuregawa (2007) we apply the approach of Ku et al. The Kaji and Kitsuregawa data is not intended to be a word-based dictionary and contains a significant number of bigrams and trigrams. Since these multiword entries are amenable to syntactic analysis, they are not the target of this experiment and the first step is to clean the data and remove them. This reduces the number of usable entries from 10,000 to 2386. Next, we extract the unique characters and generate sentiment scores for every Chinese character in the corpus using the same approach as Ku et al. We do not generate sentiment scores for script characters for the reason cited in section 3.1. This results in a total of 954 scored characters. Finally, we generate sentiment scores for each word in the corpus using the character scores. We evaluate the results by comparing the sign of the score from the sentiment dictionary to the sign of the score by our method, ignoring intensity. It should be stressed that this is considered at best a proof of concept in applying this technique to Japanese and this is not intended to be a rigorous experiment. By training and testing on the same corpus we effectively simulate perfect knowledge of the sentiment scores of the words. Thus, the results should be

interpreted more as an upper bound on the performance of the approach than as results for comparison to other sentiment analysis systems.

4. RESULTS 4.1. Overall results

Overall results for this approach are surprisingly high, with the precision scores being fairly impressive for both negative and positive sentiment. Although we cannot draw conclusions about whether this level of performance is achievable in a real-world system, it does show rather emphatically that despite its simplicity the technique does not suffer from structural limitations that would prevent it from being effective given a reliable source for word-level sentiment data. 4.2. Detailed results for character sentiment analysis In order to show that the system is indeed characterizing character sentiment in a manner similar to that of a human, we present the top 10 most positive and negative characters along with their most common English meaning or set of meanings. We believe the results are intuitive.

5. ERROR ANALYSIS In this section we examine the output of the system in an attempt to identify systematic errors. Due to time constraints, it was not possible to analyze all errors. Instead, we attempt to choose a representative subset. Of the 475 errors that occurred, 100 were selected for analysis: 50 from the false positives and 50 from the false negatives. This list was further pruned to eliminate bigrams, words that contained script characters as part of the compound, and words that consisted of only a single Chinese character. The motivation for this pruning was that we are interested most in errors due to lack of compositional knowledge, and the above classes of words cannot by their nature demonstrate such errors.

After pruning we arrive at 15 false positives and 31 false negatives for analysis. For each word, we examine it to see if a compositional explanation exists for the error. E.g. for composition class 5, where the first character negates the second, we look for cases where the sentiment of the second character is opposite in sign to the true sentiment of the word. 5.1 Errors due to missing compositional knowledge In this section, references are made to the compositional classes introduced in section 3.1. Of the false positives, 33% were found to be explained by a lack of compositional knowledge. There were two classes of such errors. The first was composition class 5, or compounds where the first character negates (or flips) the meaning of the second character. An example is the word “ 無難 ” (safety) which is formed of the characters “not” and “difficulty, hardship”. Although both characters have negative sentiment scores, the overall score should be the reversed value of the second character. These errors represented 6.7% of the total. The second is composition class 3, specifically compounds where the first character is emphasizing a positive character. An example is “ 重宝”(priceless) consisting of “heavy” and “treasure”. The first character is associated with a negative sentiment score, yet in this compound it serves to enhance a positive character and thus the score should be positive. These errors represented 27% of the total. Of the false negatives there were again two classes of errors, and 54.8% of the errors can be explained by them. The first class is again composition class 5 representing 3.2% of the errors. The second class is not one of the five compositional classes introduced in section 3.1. It is a novel group making up 51.6% of the errors and consisting of words that include the character “ 的 ” . This character has no meaning in itself for the words in this class; it simply changes the part of speech of the word. Therefore, it should not influence the word’s sentiment score, but since it has a large negative value it results in many false negatives. If the results of this sample were representative across the corpus, it would indicate that roughly 27% of all errors are due to compositional knowledge.

5.2 Script characters In section 3.1 we introduced the fact that there are two major classes of words containing script characters: those entirely composed of script, and those combining both script and Chinese characters. For the former, 150/475 or 31.6% of positive sentiment errors were due to the fact that we could not develop any kind of sentiment score for a word consisting entirely of script. This class made up 201/536 or 37.5% of the negative sentiment errors. It is difficult to gauge the impact of script characters on the latter class, although it is likely nonzero. As a result, we can conclude that at least 34.7%, and possibly more, of all errors were due to script characters. 5.3 Problems with source data The simple methods used to prune bigrams and trigrams from the source data were not completely effective. Examination of the output shows that approximately 4-5% of the entries in the training corpus are bigrams, which could introduce noise into the data. Finally, although for the purpose of this paper the sentiment dictionary of Kaji & Kituregawa was considered the gold standard, we must point out that there are several items in the error analysis that call into questions their sentiment scores. Some specific examples are 無用 and 不用, both of which mean “useless” yet quizzically received high positive sentiment scores and showed up as false negatives for positive sentiment in these results. 5.4 Detailed ErrorAnalysis The first figure below shows the full list of multicharacter false positives. The column headings are respectively the word itself, the score as calculated by Kaji & Kitsuregawa, the score for the word via the method of this paper, and finally the scores for the individual characters in the word(Kaji & Kitsuregawa, 2007).

The second figure below repeats this analysis for the false negatives. The detailed analysis was not done for all words because their sentiment value was dominated by the single character “的” which has a very high negative value of -24.

6. CONCLUSIONS AND FUTURE WORK 6.1 Conclusions We hypothesized in section 3 that the technique of Ku et al. would be less effective on Japanese due to characteristics of the language. After performing the experiment, the results are not exactly what we had supposed – yet in many ways are more interesting than what we had expected. First, the overall results were quite good. F1 scores above 0.7, while perhaps not state of the art, are certainly not poor. Although this was unexpected, it has the benefit that it is easier to draw conclusions from good results than from bad and we can at least conclude that this general method does appear to hold promise.

Second, despite the surprise of the overall results, the hypothesis that there would be structural issues with Japanese was shown to be accurate. As much as 60% of the errors can be explained between the compositional rules and the system’s inability to process script characters. 6.1 Future work The overall results encourage further work in adapting this technique to Japanese. The first step should be to apply the technique with more rigor. A simple way to do this would be to do an annotation study of novel content and apply the character sentiment data developed in this work to it. From this, meaningful conclusions could be drawn about the efficacy of the approach. A more complex experiment

would be to develop an independent Japanese sentiment dictionary and then perform the experiment; the sentiment dictionary of Kaji & Kitsuregawa a large amount of data and it would be ideal to start from a more robust corpus. Next, it would be interesting to experiment with compositional features. Preliminary results seem to indicate that approximately 27% of the errors could be addressed with compositional knowledge and appropriate logic. A potential barrier to this is that we are not aware of any current lexical resources that could be used to implement it. Finally, a large percentage of the errors (approximately 35%) are due to script characters. Performance cannot be improved on these words using the techniques of this paper since the script characters are not sentiment bearing. A hybrid approach could, however, be applied that combines character and wordbased sentiment analysis to improve performance.

REFERENCES [1] Rui-hong Huang and Le Sun and Long-xi Pan. "SCAS in Opinion Analysis Pilot Task: Experiments with sentimental dictionary based classifier and CRF model" Proceedings of the Sixth NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval, Question Answering and Cross-Lingual Information Access, 2007. [2] L. W. Ku, Y. T. Liang, and H. H. Chen, "Opinion extraction, summarization and tracking in news and blog corpora," Proceedings of AAAI-2006 Spring Symposium on Computational Approaches to Analyzing Weblogs, 2006. [3] Kaji, N. & Kitsuregawa, M. "Building Lexicon for Sentiment Analysis from Massive Collection of HTML Documents" Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007. [4] Kim, S. and Hovy, E. 2004. "Determining the sentiment of opinions" Proceedings of the 20th international Conference on Computational Linguistics (Geneva, Switzerland, August 23 - 27, 2004). International Conference On Computational Linguistics. Association for Computational Linguistics, Morristown, NJ, 1367. [5] Yamamoto, Natsuhiko. "漢字検定２級体験記　熟語の構成." 山本夏彦の本　つかぬことを言う. 20 October 2008. . [6] Henshall, Kenneth. A Guide to Remembering Japanese Characters. Tokyo: Tuttle Publishing, 1988. [7] Zagibalov, T. and J. Carroll "Automatic seed word selection for unsupervised sentiment classification of Chinese text" Proceedings of The 22nd International Conference on Computational Linguistics (COLING), Manchester, UK, 2008. [8] Mei, J., Zhu, Y. Gao, Y. and Yin, H.. tong2yi4ci2ci2lin2.Shanghai Dictionary Press. 1982. [9] Huang, Chu-Ren, Ru-Yng Chang, and Shiang-bin Li. 2008. Sinica BOW: A bilingual ontological wordnet. To appear in: Chu-Ren Huang et al. Eds. Ontologies and Lexical Resources for Natural Language

Processing. Cambridge Studies in Natural Language Processing. Cambridge: Cambridge University Press. [10] STONE, Philip J., SMITH, Marshall S., OGILIVIE, Daniel M., DUMPHY, Dexter C. The General Inquirer: A Computer Approach to Content Analysis. MIT Press. 1966. [11] Kando, Noriko. "Evaluation of Information Access Technologies: Information Retrieval, Question Answering, and and Cross-Lingual Information Access." The 6th NTCIR Workshop (2006/2007). 01 December 2008. . [12] Jordan, Eleanor Harz and Noda, Mari. Japanese: The Written Language. Boston: Chen & Tsui Company. 1998.

Whitepaper

Overview

More details

Related Documents

Whitepaper

Whitepaper

Whitepaper

Whitepaper

Whitepaper

Whitepaper