Distributional Information Modeling & clustering a vector space on a 2-million-word corpus of Spanish with an unsupervised algorithm
Fernando Balbachan & Damir Ćavar
Distributional Information
Distributional Information
One’s ability to produce and recognize grammatical utterances is not based on notions of statistical approximation and the like (Chomsky 1957: 16) Given the grammar of a language, one can study the use of the language statistically in various ways; and the development of probabilistic models for the use of language (as distinct from the syntactic structure of the language) can be quite rewarding. (Chomsky 1957:17,note 4)
2
Distributional Information
original research short-term goals
To demonstrate empirically that distributional information provides a powerful cue to syntactic category membership (lexical type information)
To propose a draft model for machine learning by clustering of tokens from large corpora
3
Distributional Information
Paradigms in parsing -
Connectionist models: psycholinguistic evidence in humans (incremental parsing)
-
Symbolic parsers: exhaustive parsing with aprioristic system of rules
-
Statistical approach: induction of syntactic rules from tabula rasa 4
Distributional Information
Inspiration for statistical approach Zipf’s Laws (´30s) Handcrafted counting of words Function words
Word tokens in corpora Mapping on dictionaries
50% 2
Content words (very low freq)
5
45%
98% 5
Distributional Information
Inspiration for statistical approach Zipf’s Laws (´30s) Verified on Brown Corpus (prototype in Scheme)
Click me (and wait )
Half of journalist reportages in Brown Corpus (no tagged)
6
Distributional Information
original research - stages 1° corpus handling
A statistical balanced corpus around 2 millions tokens, that guarantees avoiding bias of frequency or topic Treatment of punctuation marks Powerful text editors compliant with RegExp and Unicode-8 (JEdit)
2° research for prototype
Programming language choice Python with library of functions for clustering (Pycluster) http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm
3° design of algorithm : 3 guidelines
Measuring the distribution of contexts (left and right) within which each words occurs Comparing the distributions of contexts for pairs of words (bigrams) Grouping together words with similar distribution of contexts
7
Distributional Information
design of algorithm - stages 1° stage Cue id Decreasing Frequency Profile 2° stage Mutual Information to the left and to the right 3°stage vector space
4°stage Clustering with improved K-means algorithm 8
Distributional Information
design of algorithm
Decreasing Frequency Profile (DFP) (Ćavar, 2004)
Decreasing Frequency Profile: rank all the types by their frequency Create bi-gram model (as follows…)
9
Distributional Information
design of algorithm Decreasing Frequency Profile (DFP) (Ćavar, 2004)
1) Create a bi-gram list (i.e. left and right occurrence of each token, or each tokens left and right context) 2) Identify cues
Create two bags of words: a. high-frequent context words, b. all words in the context of (a) (taken from the bi-grams where the token occurs) beginning with the highest ranked token in the DFP:
add the token to (a) add all words in its context to (b)
stop if no new word is added to (b) usually ends with some 100 tokens in (a), approx. 90% of all tokens in (b)
3) Generate vector space:
all words from (a) are columns, all words from (b) are rows first half of vector is for each word of (b) occurring left of the words in (a) second half of vector is for each word ob (b) occurring right of the words in (a) 10 the cell values are the co-occurence frequencies
Distributional Information
design of algorithm
Decreasing Frequency Profile (DFP) (Ćavar, 2004)
this a remarkable improvement regarding previous attempts of clustering (Redington et al., 1998), where the cue list is arbitrarily cut out in the ranking profile this heuristics is also useful to classify function words vs. content words, even in unknown languages
11
Distributional Information
design of algorithm
Mutual Information (MI)
Given the massive scope of the corpus, the cue list versus the context words will result in a huge matrix. It is important to narrow down the statistical weight of the cues and their concordance, by calculating the average of Mutual Information MI (a.k.a. informativeness) for each cue MI = P(xy) * log2
P(xy) P(x) * P(y)
It is expected for the functional words in a left headed grammar that MI to the right be significantly greater than MI to the left. Therefore, the 12 matrix for vector space will be half-populated.
Distributional Information
design of algorithm
Mutual Information (MI) for a token X calculate MI for all bi-grams where it occurs left, take average, assume this is rightward MI of X for a token X calculate MI for all bi-grams where it occurs right, take average, assume this is leftward MI of X Delete from the Vector Space the column where X has lower MI score (if MI left is lower, delete it... etc.) Reduce Matrix Size by approx. 50 % (rare case of equal Average Left- or Right-ward MI) 13
Distributional Information
design of algorithm
vector space
Implemented as a matrix with relative frequencies of bigrams CUECONTEXT (more frequent due to MI in 2° stage) and CONTEXT-CUE (less frequent due to MI in 2°stage) context
cue
W1
W2
(…)
W107
W108 W109 (…) W71.460
14
Distributional Information
design of algorithm Clustering with K-means algorithm
One improvement over the already known K-means algorithm is to start clustering from the most distant vectors K-max = number of clusters = number of cues Increase the number of clusters from 2 till K-max
Sample of iteration X
First cycle K=2 Clusters 0 and 1 (centroid assignment)
X
X
X
X
X
X
X
First cycle K=2 Clusters 0 and 1 (Euclidean distance error calculation)
X
X X
X
X
First cycle K=2 Clusters 0 and 1 (recomputation of means)
X
X
X
X
new cycle with K = 3 comparison of errors (try reassignment) errorclust= ΣV1Vn eucli dist. to centroid
15
X
Distributional Information
Clustering working !!!!!
Script: frequency3.py Parameters: es-00001.txt >spanish.log How to run the experiment: 1) Install Python 2.4 on Windows 2) Install the 4 hardcoded functions of clustering for Python on Windows ( PYcluster, NumArray, NumPy, Numeric) 3) Put script frequency3.py in the same folder as Kmeans.py and the corpus file (.txt) 4) Run script frequency3.py with the parameters (file_name = corpus name) 5) Wait 4 days at least :) 6) This will generate a file called spanish.log as output
16
Distributional Information
killing the time & power of calculation
My laptop blew up out of memory running the improved script over the huge corpus. It takes 7 days on servers. Even so, the generated log is 23 Mb (6835 pages)
Therefore, it is important to work with a summary of the log
The final attempt shows a summary with 35 relatively-high-purity clusters Purity (aka Accuracy) =
hits
hits + false positives 17
Distributional Information
analysis of log - outcome some facts: 1) All major syntactic categories has been accurately clustered (nouns by gender and number, conjugated verbs, etc.) with purity greater than 90%. However there was some dispersion (3 clusters for the same major category) (#19-18-0 and #11-1-14) 2) Some others are remarkably highly refined clusters : verb participle (#7 = 100%), Demonstrative and Possesive Nouns (#6 = 50%) and Family ties (!!!) (#23 = 75% even when they differ in gender), verbs of speech attitude (!!!) (#5 = 100%), proper names for countries 3) Yet, there is still 2 junky clusters (one with thousands types) and low purity clusters – Low completeness due to misses 18
Distributional Information
analysis of log - outcome Major syntactic category
Clusters #
n=
purity
Singular masculine noun
1 -11 -32
291 – 97 - 22
99% - 100% -100%
Plural masculine noun
9 -20 -30
44 – 8 - 266
100% - 100% -100%
Singular feminine noun
0 – 18 – 19 - 29
14 – 166 – 531 – 60
93% - 100% - 100% -100%
Plural feminine noun
21 – 28
150 – 20
100% - 100%
Conjugated verbs
8
574
98%
Infinitive form verbs
17
82
25%
Conjunctions
13
7
60%
Adverbs
2
17
50%
Demonstrative and possesive pronouns
6
24
50%
Adjectives
16
905
50%
Prepositions
12
33
25% 19
Distributional Information
analysis of log - outcome
prepositions (n=33)
pronouns (n=24)
adverbs (n=17)
adjectives (n=1005)
conjunctions (n=7)
verbs (n=574)
120 100 80 60 40 20 0 nouns (n=1,669)
purity (%)
purity of clusters
Major syntactic category
One major problem: In spanish adjectives may act as nouns: similar vectorial behavior with respect to cues like articles and determiners
20
Distributional Information
A humble comparison… Redington et al.
Balbachan & Ćavar
Observations
Corpus size (tokens)
1,500,000
2,000,000
Types for analysis
1,150
71,467
Cue list
150
107
In Redington (benchmark or context) it is arbitrary, in ours by DFP
Context list
1000
71,360
In Redington, target word
Dimensions of vector
150 x 4 = 600
107 x 2 = 214
Matrix (vector space)
600 x 1000 = 600,000
107 (due MI) x 71,360 = 7,635,520
12.5 times more calculations in our project
Clustering technique
Hierarchical Average Link
K-means
In Redington it is a dendogram arbitrarily cut at 0,8 level, in ours it is means optimization
Mutual Information
Bigrams and trigrams to the left and to the right
Only bigrams either to the left or to the right (according to MI) 21
Distributional Information
original research long-term goals
To take advantage of the algorithm in order to induce highly refine syntactic rules
Tagging cue according MULTEXT nomenclature -> problem of rich morphology Category guesser (euclidean distance of cues, applied to unknown word with small context, with respect to vector cenroids for major categories)
Semantic bootstrapping: Evidence that the method captures some semantic relationships
To crosslinguistically test the results (currently working along with Prof. Ćavar on Croatian, German & English)
22
Distributional Information
bibliography
Manning, C. & Schütze H. (1999). Foundations of Statistical Natural Language Processing. The MIT Press. Cambridge (Massachusetts) Redington, M. Charter N. and Finch S. (1998) Distributional Information: A Powerful Cue for Acquiring Syntactic Categories. Cognitive Science vol 22(4) pp.425-469 Elghamry, Khaled and Ćavar, Damir. (2004). Bootstrapping Cues for Cuebased Bootstrapping. Mscr. Indiana University Shannon, C. (1948). A mathematical theory of communication. Bell System Technical Journal, XXVII:379-423 Damir Ćavar, Paul Rodrigues, Giancarlo Schrementi (2004) Syntactic Parsing Using Mutual Information and Relative Entropy. To appear in IULC Online Workingpapers, proceedings of Midwest Computational Linguistics Colloquium (MCLC) 2004. 23