Clustering

November 2019
PDF

Download

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA

Overview

Download & View Clustering as PDF for free.

More details

Words: 1,745
Pages: 23

Preview
Full text

Distributional Information Modeling & clustering a vector space on a 2-million-word corpus of Spanish with an unsupervised algorithm

Fernando Balbachan & Damir Ćavar

Distributional Information

Distributional Information

One’s ability to produce and recognize grammatical utterances is not based on notions of statistical approximation and the like (Chomsky 1957: 16) Given the grammar of a language, one can study the use of the language statistically in various ways; and the development of probabilistic models for the use of language (as distinct from the syntactic structure of the language) can be quite rewarding. (Chomsky 1957:17,note 4)

2

Distributional Information

original research short-term goals 

To demonstrate empirically that distributional information provides a powerful cue to syntactic category membership (lexical type information)



To propose a draft model for machine learning by clustering of tokens from large corpora

3

Distributional Information

Paradigms in parsing -

Connectionist models: psycholinguistic evidence in humans (incremental parsing)

-

Symbolic parsers: exhaustive parsing with aprioristic system of rules

-

Statistical approach: induction of syntactic rules from tabula rasa 4

Distributional Information

Inspiration for statistical approach Zipf’s Laws (´30s) Handcrafted counting of words Function words

Word tokens in corpora Mapping on dictionaries

50% 2

Content words (very low freq)

5

45%

98% 5

Distributional Information

Inspiration for statistical approach Zipf’s Laws (´30s) Verified on Brown Corpus (prototype in Scheme)

Click me (and wait  )

Half of journalist reportages in Brown Corpus (no tagged)

6

Distributional Information

original research - stages 1° corpus handling   

A statistical balanced corpus around 2 millions tokens, that guarantees avoiding bias of frequency or topic Treatment of punctuation marks Powerful text editors compliant with RegExp and Unicode-8 (JEdit)

2° research for prototype  

Programming language choice Python with library of functions for clustering (Pycluster) http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm

3° design of algorithm : 3 guidelines   

Measuring the distribution of contexts (left and right) within which each words occurs Comparing the distributions of contexts for pairs of words (bigrams) Grouping together words with similar distribution of contexts

7

Distributional Information

design of algorithm - stages 1° stage Cue id Decreasing Frequency Profile 2° stage Mutual Information to the left and to the right 3°stage vector space

4°stage Clustering with improved K-means algorithm 8

Distributional Information

design of algorithm

Decreasing Frequency Profile (DFP) (Ćavar, 2004)  

Decreasing Frequency Profile: rank all the types by their frequency Create bi-gram model (as follows…)

9

Distributional Information

design of algorithm Decreasing Frequency Profile (DFP) (Ćavar, 2004) 



1) Create a bi-gram list (i.e. left and right occurrence of each token, or each tokens left and right context) 2) Identify cues 



Create two bags of words: a. high-frequent context words, b. all words in the context of (a) (taken from the bi-grams where the token occurs) beginning with the highest ranked token in the DFP:  

 



add the token to (a) add all words in its context to (b)

stop if no new word is added to (b) usually ends with some 100 tokens in (a), approx. 90% of all tokens in (b)

3) Generate vector space:   



all words from (a) are columns, all words from (b) are rows first half of vector is for each word of (b) occurring left of the words in (a) second half of vector is for each word ob (b) occurring right of the words in (a) 10 the cell values are the co-occurence frequencies

Distributional Information

design of algorithm

Decreasing Frequency Profile (DFP) (Ćavar, 2004) 



this a remarkable improvement regarding previous attempts of clustering (Redington et al., 1998), where the cue list is arbitrarily cut out in the ranking profile this heuristics is also useful to classify function words vs. content words, even in unknown languages

11

Distributional Information

design of algorithm

Mutual Information (MI) 

Given the massive scope of the corpus, the cue list versus the context words will result in a huge matrix. It is important to narrow down the statistical weight of the cues and their concordance, by calculating the average of Mutual Information MI (a.k.a. informativeness) for each cue MI = P(xy) * log2



P(xy) P(x) * P(y)

It is expected for the functional words in a left headed grammar that MI to the right be significantly greater than MI to the left. Therefore, the 12 matrix for vector space will be half-populated.

Distributional Information

design of algorithm

Mutual Information (MI) for a token X calculate MI for all bi-grams where it occurs left, take average, assume this is rightward MI of X for a token X calculate MI for all bi-grams where it occurs right, take average, assume this is leftward MI of X Delete from the Vector Space the column where X has lower MI score (if MI left is lower, delete it... etc.) Reduce Matrix Size by approx. 50 % (rare case of equal Average Left- or Right-ward MI) 13

Distributional Information

design of algorithm

vector space 

Implemented as a matrix with relative frequencies of bigrams CUECONTEXT (more frequent due to MI in 2° stage) and CONTEXT-CUE (less frequent due to MI in 2°stage) context

cue

W1

W2

(…)

W107

W108 W109 (…) W71.460

14

Distributional Information

design of algorithm Clustering with K-means algorithm   

One improvement over the already known K-means algorithm is to start clustering from the most distant vectors K-max = number of clusters = number of cues Increase the number of clusters from 2 till K-max

Sample of iteration X

First cycle K=2 Clusters 0 and 1 (centroid assignment)

X

X

X

X

X

X

X

First cycle K=2 Clusters 0 and 1 (Euclidean distance error calculation)

X

X X

X

X

First cycle K=2 Clusters 0 and 1 (recomputation of means)

X

X

X

X

new cycle with K = 3 comparison of errors (try reassignment) errorclust= ΣV1Vn eucli dist. to centroid

15

X

Distributional Information

Clustering working !!!!!

Script: frequency3.py Parameters: es-00001.txt >spanish.log How to run the experiment: 1) Install Python 2.4 on Windows 2) Install the 4 hardcoded functions of clustering for Python on Windows ( PYcluster, NumArray, NumPy, Numeric) 3) Put script frequency3.py in the same folder as Kmeans.py and the corpus file (.txt) 4) Run script frequency3.py with the parameters (file_name = corpus name) 5) Wait 4 days at least :) 6) This will generate a file called spanish.log as output

16

Distributional Information

killing the time & power of calculation

My laptop blew up out of memory running the improved script over the huge corpus. It takes 7 days on servers. Even so, the generated log is 23 Mb (6835 pages)

Therefore, it is important to work with a summary of the log

The final attempt shows a summary with 35 relatively-high-purity clusters Purity (aka Accuracy) =

hits

hits + false positives 17

Distributional Information

analysis of log - outcome some facts: 1) All major syntactic categories has been accurately clustered (nouns by gender and number, conjugated verbs, etc.) with purity greater than 90%. However there was some dispersion (3 clusters for the same major category) (#19-18-0 and #11-1-14) 2) Some others are remarkably highly refined clusters : verb participle (#7 = 100%), Demonstrative and Possesive Nouns (#6 = 50%) and Family ties (!!!) (#23 = 75% even when they differ in gender), verbs of speech attitude (!!!) (#5 = 100%), proper names for countries 3) Yet, there is still 2 junky clusters (one with thousands types) and low purity clusters – Low completeness due to misses 18

Distributional Information

analysis of log - outcome Major syntactic category

Clusters #

n=

purity

Singular masculine noun

1 -11 -32

291 – 97 - 22

99% - 100% -100%

Plural masculine noun

9 -20 -30

44 – 8 - 266

100% - 100% -100%

Singular feminine noun

0 – 18 – 19 - 29

14 – 166 – 531 – 60

93% - 100% - 100% -100%

Plural feminine noun

21 – 28

150 – 20

100% - 100%

Conjugated verbs

8

574

98%

Infinitive form verbs

17

82

25%

Conjunctions

13

7

60%

Adverbs

2

17

50%

Demonstrative and possesive pronouns

6

24

50%

Adjectives

16

905

50%

Prepositions

12

33

25% 19

Distributional Information

analysis of log - outcome

prepositions (n=33)

pronouns (n=24)

adverbs (n=17)

adjectives (n=1005)

conjunctions (n=7)

verbs (n=574)

120 100 80 60 40 20 0 nouns (n=1,669)

purity (%)

purity of clusters

Major syntactic category

One major problem: In spanish adjectives may act as nouns: similar vectorial behavior with respect to cues like articles and determiners

20

Distributional Information

A humble comparison… Redington et al.

Balbachan & Ćavar 

Observations

Corpus size (tokens)

1,500,000

2,000,000

Types for analysis

1,150

71,467

Cue list

150

107

In Redington (benchmark or context) it is arbitrary, in ours by DFP

Context list

1000

71,360

In Redington, target word

Dimensions of vector

150 x 4 = 600

107 x 2 = 214

Matrix (vector space)

600 x 1000 = 600,000

107 (due MI) x 71,360 = 7,635,520

12.5 times more calculations in our project

Clustering technique

Hierarchical Average Link

K-means

In Redington it is a dendogram arbitrarily cut at 0,8 level, in ours it is means optimization

Mutual Information

Bigrams and trigrams to the left and to the right

Only bigrams either to the left or to the right (according to MI) 21

Distributional Information

original research long-term goals 

To take advantage of the algorithm in order to induce highly refine syntactic rules  

Tagging cue according MULTEXT nomenclature -> problem of rich morphology Category guesser (euclidean distance of cues, applied to unknown word with small context, with respect to vector cenroids for major categories)



Semantic bootstrapping: Evidence that the method captures some semantic relationships



To crosslinguistically test the results (currently working along with Prof. Ćavar on Croatian, German & English)

22

Distributional Information

bibliography

 







Manning, C. & Schütze H. (1999). Foundations of Statistical Natural Language Processing. The MIT Press. Cambridge (Massachusetts) Redington, M. Charter N. and Finch S. (1998) Distributional Information: A Powerful Cue for Acquiring Syntactic Categories. Cognitive Science vol 22(4) pp.425-469 Elghamry, Khaled and Ćavar, Damir. (2004). Bootstrapping Cues for Cuebased Bootstrapping. Mscr. Indiana University Shannon, C. (1948). A mathematical theory of communication. Bell System Technical Journal, XXVII:379-423 Damir Ćavar, Paul Rodrigues, Giancarlo Schrementi (2004) Syntactic Parsing Using Mutual Information and Relative Entropy. To appear in IULC Online Workingpapers, proceedings of Midwest Computational Linguistics Colloquium (MCLC) 2004. 23

Clustering

Overview

More details

Related Documents

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering