Thesis topic definition paper
Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science and Engineering
BACHELOR’S THESIS: Email clustering algorithm
Author: Libor Grafnetr Supervisor: Ing. Pavel Kord´ık, Ph.D. Year: 2009
Acknowledgments I would like to thank my supervisor Ing. Pavel Kord´ık, Ph.D. for guidance through the work on this thesis.
I
II
Declaration I hereby declare that I have completed this thesis independently and that I have listed all the literature and publications used. I have no objection to usage of this work in compliance with the act §60 Z´akona ˇc. 121/2000Sb. (copyright law), and with the rights connected with the copyright act including the changes in the act.
In Prague on June 9, 2009
.............................................................
III
IV
Abstract This thesis describes analysis, design and implementation of a clustering algorithm that operates on email messages. Its aim is to recognize messages that deal with similar topic and aggregate them in one cluster. The algorithm makes use of general document clustering methods combined with analysis of data specific to email messages. Given the nature of email, it supports online clustering. The functionality will be integrated into exiting email client software named ”eM Client”. Implementation of suitable data structures with database backend, mail client interoperability code and user interface is also part of this work.
Abstrakt Tato pr´ ace popisuje anal´ yzu, n´ avrh a implementaci shlukovac´ıho algoritmu, kter´ y bude pracovat s emailov´ ymi zpr´ avami. Jeho c´ılem je rozpoznat zpr´avy, kter´e se t´ ykaj´ı stejn´eho t´ematu, a sdruˇzit je do jednoho shluku. Algoritmus vyuˇz´ıv´a obecn´ ych postup˚ u shlukov´ an´ı dokument˚ u zkombinovan´ ych s anal´ yzou informac´ı specifick´ ych pro emailov´e zpr´avy. Vzhledem k povaze emailu algoritmus podporuje inkrement´aln´ı shlukov´an´ı. Shlukovac´ı funkcionalita bude integrov´ana do existuj´ıc´ıho emailov´eho klienta ”eM Client”. Souˇc´ast´ı pr´ ace je tak´e implementace vhodn´ ych datov´ ych strukt˚ ur s napojen´ım na datab´ azi, k´odu pro spolupr´ aci s mailov´ ym klientem a uˇzivatelsk´eho rozhran´ı.
V
VI
Contents 1 Introduction 1.1 Motivation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1
2 Problem description 2.1 Project breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Existing approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 3 4 5
3 Analysis 3.1 Project components . . . . . . . . . . . . . 3.1.1 Clustering framework . . . . . . . . 3.1.2 Distance function and text analysis . 3.1.3 Data structures . . . . . . . . . . . . 3.1.4 Interoperability code . . . . . . . . . 3.1.5 User interface . . . . . . . . . . . . . 3.2 Clustering . . . . . . . . . . . . . . . . . . . 3.2.1 Algorithms . . . . . . . . . . . . . . 3.2.1.1 K-means . . . . . . . . . . 3.2.1.2 Hierarchical clustering . . . 3.2.1.3 Fuzzy C-means . . . . . . . 3.2.1.4 Conclusion . . . . . . . . . 3.2.2 Data instances representation . . . . 3.2.3 Measuring distance . . . . . . . . . . 3.2.4 Email messages . . . . . . . . . . . . 3.2.5 Document clustering . . . . . . . . . 3.2.5.1 Tokenization . . . . . . . . 3.2.5.2 Term ranking and TF-IDF 3.2.5.3 Cosine measure . . . . . . 3.2.6 eM Client . . . . . . . . . . . . . . . 4 Design 4.1 Hierarchical agglomerative 4.2 Email messages . . . . . . 4.3 Data instance features . . 4.4 Text analysis . . . . . . .
clustering . . . . . . . . . . . . . . . . . .
. . . .
VII
. . . .
. . . .
. . . .
. . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . . . . . . . . .
7 7 7 7 8 8 8 9 9 9 11 12 13 14 15 15 16 17 17 17 18
. . . .
21 21 23 23 24
4.5 4.6 4.7 4.8
Distance function . Database backend User interface . . . Integration layer .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
25 25 25 26
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
27 27 28 30 30 32
6 Testing 6.1 Clustering quality . . . . . . 6.1.1 Subjective evaluation 6.1.2 Objective evaluation . 6.2 Data structures optimization 6.3 Performance . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
33 33 33 35 36 36
5 Implementation 5.1 Clustering . . . . 5.2 Mail clustering . 5.3 Storage . . . . . 5.4 User interface . . 5.5 Integration layer
. . . . .
7 Conclusion 37 7.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Bibliography
39
A SQL Schema
41
B Clustering Illustrations
45
C CD directory structure
47
VIII
Chapter 1
Introduction 1.1
Motivation
Email has become one of the most important means of communication among people in both personal and corporate environment. Its unprecedented features have allowed people to communicate and cooperate in their activities like no other technology before. Since its notable spread among people, the volume of email messages has risen by far more than only few ranks and now it represents a core component of every corporation’s workflow. However, the way people are presented their email messages by software almost has not changed. While simple list of emails may be sufficient for low volume of mostly independent mails, today’s communication frequently consist of interconnected messages, bound by similar topic, possibly from several different individuals. The need to change how people work with their email is known for many years, the problem is getting into general consciousness and public media are focusing its attention to it more often. There are two basic (but interconnected) parts of the issue: the first is the immense volume of emails that many users are subject to - this problem has been coined ”Email Overload”, the second part is most users’ inability to use email efficiently, e.g. sending unnecessary messages or overutilizing reply-all functionality. There have even been attempts to quantize costs of inefficient email usage, with estimates as high as $650 billion [1]. Automatically organizing user’s messages would be a significant remedy to the problem. It would allow users to distribute their time more efficiently to relevant tasks, speed up information look-up and generally create a more pleasant email client environment, in which many people spend significant part of their work time.
1
2
CHAPTER 1. INTRODUCTION
Chapter 2
Problem description Aim of this work is to create an email client extension that automatically clusters user’s email messages and presents him with the result. The clustering process must be incremental as new messages continually arrive into user’s inbox and framework’s data necessary to perform the clustering must be persistent between application executions. The resulting data consisting of clusters and emails they contain must be presented to user in a comprehensible and non-intrusive way from within the email client.
2.1
Project breakdown
To achieve the above we have divided the project into following tasks that need to be accomplished: • Devise a variation on existing clustering algorithm that will suit specific needs of the clustering task. • Analyze email messages nature and find relevant data mining features that will constitute basis for measuring similarity between email instances. • Design a similarity measurement function yielding good results for evaluation of emails’ topic similarity. • Implement the algorithm, knowledge extraction framework and suitable data structures that will allow persisting clustering work data. • Integrate the implemented module into ”eM Client” and create an user interface that will present the clustering result and allow user to execute necessary operations of the clustering framework. • Test the project with testing dataset and point out areas where optimization had to be added and where more improvement could be done in future.
3
4
CHAPTER 2. PROBLEM DESCRIPTION
2.2
Existing approaches
Despite the urgency of the problem a successful solution to that would gain widespread acceptance has not been created yet. There are several partial solutions in use, but they either require continual feedback from the user, the extent of their helpfulness is rather limited or were not finished and made available for public use. Existing partial solutions include following: • Threaded view is a method that displays a conversation tree based on information about relation to previous emails. This method works only for mails that were replied to or forwarded. New email concerning the same topic from a user is not connected. This method is one of the oldest attempts to improve experience when working with email, but it is very limited, especially with the present scale of the overload problem. • Conversation view is an improved variation on the threaded view approach. It has been introduced in Google’s Gmail service. It improves the user experience by flattening the threaded view tree and by displaying each conversation as single item in the basic folder view. The approach provides a clearer view on the user’s emails, but still doesn’t constitute a powerful tool to radically improve work with email. • Rules - most modern desktop email clients have means to define rules consisting of a condition and an action to be performed if the specified condition is met. The condition is usually used to match specific substring in email subject or body and the action then performs move or copy operation if the condition is met. While rules are relatively popular among desktop client users, its filtering capabilities are limited, because the user has to define every single rule. Common use therefore is separating several basic types of email that enter user’s inbox rather than separating every new topic that has to be dealt with. • Automatic foldering is a more sophisticated approach based on filters matching the message with existing mail folders. Real world implementations allows user to define folders that automatically process incoming mails and filter messages similar to those already present in the folder. The user needs to add several example messages when the folder is created to provide some initial data to the folder filter and is also allowed to add or remove messages from the folder later on, further improving its filtering performance. A popular implementation of this method is present in the browser/email client software ”Opera”. Although a promising approach, known implementations still require the user to define a folder for each topic he would like to separate. Algorithms used in these filters are for example: Bayesian filtering, Support Vector Machines or k-NN. Attempts to integrate new folder recognition were also made, but haven’t had any major influence. Overview of existing works in automatic foldering and their effectiveness can be found in[2]. Our approach might be viewed as a restatement of automatic foldering extended with new folder recognition, but from clustering branch of data mining point of view. Most notable work having similar approach to this thesis is a master thesis of Gabor Cselle: Organizing Email[3]. Cselle’s work has notably influenced author in his research. Cselle has performed an extensive research on email organization and on basis of the work
2.3. THESIS STRUCTURE
5
created a clustering framework and an extension to email client ”Thunderbird” that performed email clustering. However the extension itself was never published.
2.3
Thesis structure
This document will be structured into five main chapters reflecting progress of author work: • chapter 3 will describe project components, necessary clustering theory that is used in our implementation and also the software ”eM Client” that will integrate our the clustering functionality. • chapter 4 provides an in-depth description of the design of the clustering algorithm, text analysis algorithm, similarity function, utilized email message features, realization of the database backend and the design of the user interface. • chapter 5 explains architecture of the project, describes classes that were implemented, the way clustering is integrated with the email client and the background of the user interface. • chapter 6 describes how was the final product tested, what steps were taken to improve performance and enumerates existing issues
6
CHAPTER 2. PROBLEM DESCRIPTION
Chapter 3
Analysis This chapter will describe logical components employed in the clustering process, basic data mining concepts and the target environment that the project will be integrated with.
3.1
Project components
The whole mail clustering process has many steps that can be divided into following several blocks. The first component would be a clustering framework containing the clustering algorithm, clusters, instances, data source for instances and other necessary infrastructure. Tightly connected is a distance function utilized by the cluster algorithm and a text analysis that evaluates textual similarity of the mail content. In addition there is a database persistence framework that needs to be provided to the clustering framework in the most transparent way possible. Finally a component integrating the clustering with the email client and an user interface displaying the results from the clustering framework needs to be connected to the above components.
3.1.1
Clustering framework
The core part of the project is the clustering algorithm itself. On input this algorithm receives data instances from a data source. These items are first pre-processed by the distance function for information extraction. Afterwards, instances are processed by the distance function to determine similarity to existing instances. Which instance pairs are processed by the distance function depends on the particular algorithm. Based on the calculated distance values clustering process is executed. During this process clusters are formed or modified. Once the algorithm meets its stop criteria, the new mail instances have been clustered and updated clusters information may be loaded by other components.
3.1.2
Distance function and text analysis
The distance function provides the clustering algorithm information about individual instances similarity to each other. Every new instance is passed to the distance function by
7
8
CHAPTER 3. ANALYSIS
the clustering algorithm for processing. The distance function further passes the new instance to text analysis subcomponent. When a clustering algorithm calls distance function, similarity of selected mail item features is evaluated and also a mail body text similarity is calculated. These several scores are then merged into a distance value that is returned to the clustering algorithm. Text analysis subcomponent processes mail body text of new instances and updates its internal data structures that store relevance of individual words. When a distance function is asked to calculate similarity, text analysis performs similarity comparison of the two mail instances texts and calculates text similarity score using known text clustering formulas.
3.1.3
Data structures
Specialized data structures are necessary to store all data used in the whole clustering process. Except simple lists used to store existing clusters and data instance, dictionaries are used for pre-calculated text analysis data and two-dimensional dictionaries are used to store distance values between individual instances and clusters. Although lists and one-dimensional dictionaries exist in the target environment for in-memory operation, they had to be implemented as layers over a relational database used to persist the data. Two-dimensional dictionary had to be implemented for both, in-memory and database operation. As these data structures are heavily utilized during the clustering process, they need to be highly optimized as well as the underlying database schema.
3.1.4
Interoperability code
To integrate previous components with the email client, an interoperability layer needs to exist. This layer will govern the initialization of the clustering framework and its database backend and provide a custom data source transforming email messages from the client to data instances that the clustering process works with. The data source also needs to monitor additions and removal of email messages and forward these event to the clustering framework. There are several other functional roles described in the implementation chapter that the layer needs to perform.
3.1.5
User interface
The email client must have a user interface that will provide access to results of the clustering process. The interface must have a non-intrusive character, that is, it must let the user to use the client as usual, but easily allow to take advantage of the clustering output. The interface will be developed using existing concepts present in the client. In the most minimal form, the interface will provide view on existing clusters, display information about selected cluster, provide a method to quickly view only messages from the selected cluster and allow user to see to which cluster a mail belongs. The user interface will interact with the clustering framework primarily through the interoperability code.
3.2. CLUSTERING
3.2
9
Clustering
Clustering is a branch of data mining discipline which aims at grouping similar data instances together into sets called clusters. To be able to devise a suitable clustering process, we need to overview relevant existing knowledge that will be necessary. Our view on the techniques must also reflect specifics of the data and environment we are about to embed the technology in. Clustering of email messages can be understood as a special case of document clustering with a lot of additional metadata taking part in the clustering process. There are also other specifics such as online nature of the clustering, the fact that the target result is a large number of relatively small clusters with need to continually detect new ones and the need for unsupervised operation of the algorithm. In following sections we will describe algorithms, data representation, distance functions and knowledge extraction principles that relate to our problem. For each link in the clustering chain, we will also analyze what features and capabilities it needs to have in our implementation and why.
3.2.1
Algorithms
There are already many algorithms for clustering and new ones are continually being devised. This is mainly because each field has specific features that customized algorithms can take advantage of. For our clustering tasks we have researched the several popular and relatively versatile clustering algorithms as they are known to perform reasonably well on document clustering tasks and some of them have advantageous features that can address some of our needs. We will describe those algorithms including their advantages or disadvantages and state whether we have chosen the algorithm for our project. 3.2.1.1
K-means
K-means is one of the most essential clustering algorithms. It dates back to 1956 and has been gaining popularity ever since. It isn’t complicated to implement and performs the clustering relatively fast - its computational complexity is linear. To use K-means, it is necessary to specify the target number of clusters - let k be the desired number of clusters. The algorithm creates k data points representing cluster centers, called centroids. An iterative partitioning loop described below is then executed to carry out the clustering. The formal definition of K-means[4] is then to partition the data instances into k clusters so as to minimize a sum of squared error objective function that is defined as: E=
k X n X
|ci − pj |2
i=1 j=1
, where E is the squared-error value, k the desired number of clusters, n the number of cluster data instances, ci is a centroid of the i-th cluster and pj is j-th data instance. The subtraction of instances represents distance of the two instances. The algorithm below is a common approach to accomplish the described function minimization. The approached minimum is, however, local, which why initial centroid distribution is so important.
10
CHAPTER 3. ANALYSIS
The data points selected as cluster centers may be generated randomly, or better some form of heuristics may be used to select centroids that are likely to yield good clustering result. A popular method is to gradually pick centroids from clustered data instances so that their distance from nearest other centroid is maximal. The clustering in K-means is performed iteratively by re-calculating centroids and reassociating data instances with nearest centroid until stability is reached and no data instances change their parent cluster. A K-means algorithm in pseudo-code is written below: 1 2 3
4
5
s e l e c t _ k _ i n i t i a l _ c e n t r o i d s () ; do { a s s o c i a t e _ d a t a _ i n s t a n c e s _ w i t h _ n e a r e s t _ c e n t r o i d () ; // if some data instance changes parent cluster clustering_change variable is set to true c a l c u l a t e _ n e w _ c e n t r o i d _ p o s i t i o n s _ () ; // new mean of data instances of each cluster is calculated and set as new centroid for the cluster . while ( clustering_change ) ;
Figure 3.1: An example of K-means algorithm on random two-dimensional data Although the standard definition of k-means doesn’t deal with incremental clustering, it is possible to modify the implementation to support it and some research on this topic and issues arising from time-dependent nature of the task has already been performed [5]. However, K-means has two major disadvantages. The first is the need to know the number of clusters in advance, which would require additional estimation algorithms or substantial algorithm modifications. Second issue arises from the concept of centroid that assumes that
3.2. CLUSTERING
11
a mean of data instances can be calculated. For more complicated data instances, that are not just simple vectors (see subsection 3.2.2) this condition cannot be met. It can be overcome by avoiding the centroids completely a determining the instance distance from clusters by average distance from cluster’s member. Although a viable approach, the computational complexity would rise enormously. Although K-means is a straightforward, versatile and fast algorithm we have decided not to use it in its standard form as a clustering algorithm for our project. 3.2.1.2
Hierarchical clustering
The hierarchical clustering algorithm is together with K-means the most popular clustering technique. This method operates by building a hierarchy of clusters based on similarity of the data instances they contain. The process of building the hierarchy may be terminated when a specified condition is met, resulting in clusters of desired similarity and size. The metric used to measure similarity between instances can be chosen independently of the algorithm, but a second measure is used that defines how the distance between two clusters is calculated from distances between instances in these clusters. There are three basic measures used: • Minimum distance - the distance between two clusters is equal to the smallest distance between an instance from cluster A to an instance from cluster B. • Maximum distance - the distance between two clusters is equal to the largest distance between an instance from cluster A to an instance from cluster B. • Average distance - the distance between two clusters is equal to the average distance of instances from cluster A to instances from cluster B.
An algorithm that employs minimum distance metric and also uses a minimal inter-cluster distance as threshold to stop merging clusters is called single-linkage algorithm. Once the measure of distance between clusters is defined there are two approaches to the hierarchy’s construction: • Agglomerative - The algorithm starts with every data instance in a separate cluster and repetitively merges clusters that are nearest to each other. • Divisive - At the beginning all instances are in one cluster and a logic using instance distance function repetitively splits existing clusters.
Agglomerative method is the more frequent choice when using hierarchical algorithm and it also suits our needs better - it will allow easier implementation of the incremental clustering and provides better means to define criteria that will stop the clustering.
12
CHAPTER 3. ANALYSIS
Figure 3.2: A dendogram of clusters created by hierarchical clustering on the well-known Iris dataset A pseudo-code showing how agglomerative hierarchical single-linkage clustering operates follows: 1 2 3
4 5 6 7
c r e a t e _ c l u s t e r _ f o r _ e a c h _ d a t a _ i n s t a n c e () ; do { c a l c u l a t e _ i n t e r c l u s t e r _ d i s t a n c e s () ; // Function also stores minimal distance to variable minimal_distance if ( minimal_distance < merge_threshold ) me rg e_ nea re st _cl us te rs () ; } while ( minimal_distance < merge_threshold ) ;
Hierarchical clustering doesn’t impose any requirements on data instances or the distance function. This makes it more suitable for clustering complex data as in our case. However, the algorithm does have two issues that must be noted. Once a cluster merge (or split) is performed, it cannot be undone. That is, if an instance I is merged with a cluster A and afterwards a cluster B containing instances very similar to instance I enters the clustering process, there is no way for instance I to be moved to cluster B. This might cause clustering performance degradation in incremental clustering. Another issue is algorithm’s computational complexity that is quadratic! This is given by the need to calculate distances between all pairs of data instances. The complexity may cause problems if the algorithm is to be employed in our project. 3.2.1.3
Fuzzy C-means
Fuzzy c-means is a variation of the K-means algorithm with addition of uncertainty. Previous algorithms understood instance’s belonging to a cluster as a boolean statement and a single
3.2. CLUSTERING
13
instance could belong only to one cluster. Contrary to these models fuzzy c-means allows one data instance to belong to several clusters. The extent to which an instance belongs to a cluster is denoted by a scalar value ”degree of membership”[6]. The fuzzy nature of membership would be a very useful feature for our application, allowing to propagate algorithm’s uncertainty to user interface. The algorithm operates similarly to K-means, but in each iteration instead of re-associating data instances with nearest clusters a matrix containing degree of membership for each cluster-instance combination is recalculated. The formal definition of the algorithm is still aimed at optimizing an objective function, but the function takes the degree of membership into account, so it is now defined as: E=
k X n X
uij · |ci − pj |2
i=1 j=1
, where ui j is the degree of membership of j-th data instance to i-th cluster, other variables have the same meaning as in objective function of K-means. The formula for instance’s degree of membership to cluster A calculates the value based on proximity of the cluster A in comparison to other clusters. The formula also contains an additional parameter called fuzzifier that influences the gradient with which the membership decreases. If ui j is the degree of membership of j-th data instance to i-th cluster, then ui j =
1 2
|ci −pj | m−1 m=1 ( |cm −pj | )
Pk
, where k is the number of clusters, ci is the centroid for i-th cluster, pj is j-th data instance and m is the fuzzifier. The the variable degree of membership also has to be reflected in the centroids calculation. Instance with lesser membership to a cluster should have smaller impact on cluster’s centroid position. This can be achieved by incorporating the ui j value to arithmetic average fraction, to both - the denominator and the numerator. Fuzzy c-means seems suitable for incorporation of simple new cluster detection. If an instance’s maximum degree of membership to any cluster is under specified threshold a new cluster containing this instance may be created. Similarly to K-means, implementing incremental support into Fuzzy c-means seems viable. While the author perceives this algorithm as promising and would like to focus on its potential in future, research necessary to correctly implement it for our application would need to be extensive and over the time frame available for this work, therefore it won’t be used in our project. 3.2.1.4
Conclusion
We have described three most promising candidates for the algorithm used in our project. All of them could be modified to suit our needs, but to ensure that results yielded by the clustering framework will be of decent quality from the beginning we have decided to use the Hierarchical Agglomerative Clustering in its Single-linkage variation.
14
CHAPTER 3. ANALYSIS
3.2.2
Data instances representation
The way data instances are represented in the clustering process is highly important and depends on many factors. Many steps of the process have specific needs on the form of the input data, therefore data conversion and preprocessing is a step, often very complex and lengthy one, that most data mining efforts start with. This is usually because the work takes place in existing data mining environments that have predefined input format and only limited ability to alter how the data is being processed. Given the fact that our clustering framework is being built from scratch, we will be designing its architecture in a very modular manner to allow as much independence on the data form as possible. Only parts that need to understand the meaning of the data will have to be customized, such as distance calculating component, the rest of the framework will work only through these custom implementations. Therefore, there will be no need to convert data processed by the framework in our project and possibly in any other use, should there be any. This is a correct approach from engineering standpoint, but also a mandatory requirement dictated by the context of the project. The data processed by clustering is already present in the email client software and it is not feasible to perform any large scale conversion. While he have described the emphasis that we put into keeping the data the way it is, certain transformations will have to take place to provide information necessary for text analysis (see subsection 3.2.5). This transformation and data extraction will not, however, alter existing data, but put the extracted knowledge in a distinct data store. The target data store will be designed to facilitate performance and semantic needs of the text analysis component. The data our project will work with are email messages. Emails have a rather complicated structure, but it can be condensed to two main parts that interest us: • A decent set of metadata attributes. Many of these attributes will become instance features, meaning they will be analyzed during the clustering process, but they don’t need to be transformed or stored in an additional location. • Content - each email has one or more content parts. One of the parts will be selected and taken as a text document. This document part will be subject to processing using document clustering techniques that require further data transformation and storage.
Document clustering is usually modeled by representing documents as items in vector space. Each vector in the vector space represent one document. Terms (individual words) extracted from all documents are used as vector’s components, each term being one component. Value of each component in a vector is a score determining relevance of the associated term in the document the current vector represents (see subsubsection 3.2.5.2). Having each document represented by a vector with one component for each existing term from the document set is highly inefficient. The vectors would be mostly empty as each document contains only a small subset of all terms in a set. Therefore, the actual structure of the data store containing document terms will be different. Each document will have a
3.2. CLUSTERING
15
list of term and value pairs for terms contained in the document. Although this will slightly complicate the data structures and database backend for persisting the information, it will save large amounts of space and will have a positive effect on performance. As has been explained, there is no need to store the data instances in a specific form for purposes of the clustering. The framework should process instances independently of their form through customized processing components and the text analysis component will only store necessary calculated data, but won’t alter data instances or duplicate data they contain.
3.2.3
Measuring distance
The algorithm performs clustering based on instances distance. The quality of clustering therefore tightly depends on quality of the distance calculation. Most of clustering algorithms operate with a pair-wise distance function that takes two instances as input and output a distance value. The distance function may range from a simple formula to an algorithm combining several other distance functions, which may perform analytical computations. The methods of a distance calculation depend on types of data features that are to be reflected in the calculation. When instances are represented by vectors that contain only numerical values, simple metrics such as Euclidean distance, Manhattan distance or even linear combination of the values may be used. One feature type, our project will make use of, is a set of textual strings. For this type an intersection of two sets may be performed and an intersecting set’s size then incorporated in a composite distance function. Other feature type may be a time stamp. Two time stamp features may be subtracted and the result normalized into a predefined interval - thus obtaining a valid scalar value that may be added in the composite distance. Many data mining processes employ feature weighting to further improve the output quality. When feature weighting is used, the value of each feature is modified by a weight quotient. Weight quotients may be determined by many methods, such as training linear Support Vector Machines, gradient descent technique, Odds Ratio or Information Gain training[7][8]. Document clustering also measures the distance of documents, but here the concept of weighting individual values is realized in the text analysis process by term relevance scoring. To join many distance values calculated by various methods a grouping distance function must be formulated. Such function can be viewed simply as a next level of distance function processing the numerical feature values, except this time, the values are already an extraction of the features’ content. Therefore, the same formulas may be used. To reflect the essence of the grouping distance function - to combine partial distance values into one - a weighted linear combination is a good choice.
3.2.4
Email messages
Internet Message (termed email) is a very versatile information container from which a lot of clustering relevant data can be extracted. Email message consist of an envelope
16
CHAPTER 3. ANALYSIS
and contents[9]. Envelope contains fields such as the sender, recipients, message’s subject, a time stamp defining when the message was sent or fields containing reference to other email messages. These fields convey valuable information that has a substantial influence on recognizing relation between emails. The contents of the message may either be a plain text or consist of content parts MIME parts[10]. Most of today’s email communication contents is in the format of MIME parts. These parts allow not only several sequential data blocks, such as mail text and a binary attachment, but can form nested structures allowing to provide alternative forms of one content(e.g. a rich text format and plain text format of the message), encapsulate altered content (e.g. encryption using S/MIME standard) and have many other advantages. This flexibility slightly complicates the necessary processing logic in mail client, because it has to select the best part to display to the user. It is obvious that the displayed part will also be the one most suitable for the document clustering analysis. When clustering email messages, both the envelope fields and contents of the message must be taken into account. Therefore several metrics for envelope fields must be combined with the metric of the document clustering component to get the final distance function using combination, as described in the previous section.
3.2.5
Document clustering
Document clustering aims at grouping similar documents based on analysis of their text. It is a field of text mining, which derives many concepts from information retrieval and statistics. Many approaches to determining similarity between texts exist, but the process has usually two parts: document processing and similarity calculation. The document processing takes place at the beginning of the text clustering process and can be divided into several steps: • Decompose the text to single tokens - in most cases words. Apply preprocessing to the words, such as stemming, case conversion or stop words exclusion. Preprocessed words are now regarded as terms. • Analyze terms in context of the document they were extracted from and calculate values necessary to determine term’s relevance scoring later. The relevance scoring isn’t usually calculated during document processing, as its formula parameters change when other documents are clustered. Therefore, it is preferable to store intermediate values that relate only to this document. • Update global values relating to each term processed in the document’s analysis. When the clustering itself takes place, pairwise document similarities are being calculated. Ideally, during this calculation documents’ texts aren’t being processed, as all necessary information - the terms the documents contain associated with the values necessary for term’s relevance calculation - is already known from the initial documents’ analysis. Document similarity calculation iterates through documents’ terms, determines a relevance scoring for each term and uses the selected measure to calculate the similarity value from all term scorings from both documents. In following sections, we will describe the relevant parts of this process in more detail.
3.2. CLUSTERING
3.2.5.1
17
Tokenization
Tokenization is a process of splitting a string of characters into tokens based on predefined lexical grammar. In case of document clustering a simple grammar defining one token consisting of letters and separated by a non-letter character may be sufficient. 3.2.5.2
Term ranking and TF-IDF
It has been described that each term needs to have a relevance scoring that is used during similarity calculation. There are following reasons that require such scoring to exist and the formula used to calculate the score has to address them: • Documents vary in length. Even if one document has more occurrences of a term than other document, it doesn’t mean that it is more related to the term. The first document may just be several times longer and it may even deal with completely unrelated topic. Therefore the term’s number of occurrences must be put in relation to the document’s length. • Individual terms highly vary in their importance. Some words are very frequent and may be found in large portion of a document set, while others have very specific meaning and are present only in few documents. The fact that two documents share a specific, infrequent term has thus much higher weight than if they shared a term that can be found in most of the other documents. Thus a notion about global relevance of a term must be maintained and used appropriately in the similarity calculation. Both of these issues are addressed in a popular term ranking measure named ”TF-IDF” (Term Frequency - Inverse Document Frequency). It is a product of term’s frequency within a document and an inverse frequency of the term’s presence in all documents. TF-IDF value for term i in document j is defined by following formula: tij |D| · log mi m tmj
tf idfij = P
P
,where tij is the number of occurrences of term i in document j, m tmj is the sum of all terms in document j, |D| is the number of document in the set and mi is the number of document where term i appears. As can be seen from the formula, the score increases when the term is frequent within the document, but decreases with the number of documents it is present in. This measure has been proven to provide satisfactory output and we have decided to use it in our clustering framework. 3.2.5.3
Cosine measure
A custom metric is necessary to calculate similarity of two documents based on term relevance scores. The metric must neutralize the number of terms the documents contain and it should be normalized into defined interval. Cosine Measure is a frequent measure of choice used
18
CHAPTER 3. ANALYSIS
with TF-IDF measure. For similarity of documents j and k by summing term scores, it has following form:
· tnk kjk · kkk
P
cosinemeasurejk =
n tnj
P
,where n iterates over union of all terms from j and k, tnj is the number of occurrences of term n in document j and kjk is magnitude s of document j. The magnitude can be calculated as if j was an Euclidean vector: kjk =
X
tnj 2 . When term n is not present in a document
n
j the number of occurrences tnj shall be 0.
3.2.6
eM Client
The clustering framework will be integrated within software ”eM Client”[11]. It is a modern email client on whose development the author participates. It aims at being a full-featured collaboration product easily integrated with third-party server messaging and groupware solutions. Aside from email, the client supports calendaring, task lists and contact management. All of these items may be synchronized with a server through variety of open protocols such as IMAP, POP3, SMTP, CalDAV or CardDAV.
The product’s history is rather short, as the development began three years ago. The development team has now five members, although most of the software has been written by a team of three that has expanded several months ago.
The client is built on Microsoft .NET Framework that allows fairly dynamic growth of the application thanks to powerful features of the runtime environment and to large base class library that provides solid building blocks for software’s expansion in almost any direction.
3.2. CLUSTERING
19
Figure 3.3: The main window of eM Client eM Client doesn’t support any extensibility through plugins yet. Therefore, any new functionality needs to be integrated directly into project’s source code. However, thanks to well-designed architecture of the software, new component integration code needs to be added only to few locations. The user interface also allows new controls to be easily incorporated into existing layout and connected with other parts of the interface. The integration will not require any existing functionality to be modified. For the implementation discussed in this work, we have decided that clustering framework will operate only on items from Inbox folder of the mail client.
20
CHAPTER 3. ANALYSIS
Chapter 4
Design Based on the general principles and facts described in previous chapter we will outline concrete form of algorithms and methods used in the clustering framework. The ideas have been formed partially before the implementation, but changed during coding based on arising issues and even after the framework has been finished.
4.1
Hierarchical agglomerative clustering
From several candidates for the clustering algorithm, we have chosen the hierarchical clustering. Its principle can be implemented fairly easily, it will allow incremental clustering and the way it operates will enable us to cache much of the intermediate values it produces. We have decided for the agglomerative variant of the algorithm. Each instance therefore starts with its own cluster and individual clustering iterations merge nearest clusters. With regard to the purpose of the clustering, we must view the clustering process as continual and never ending. It won’t be performed again and again, but rather started once and then slowly proceeding as new data instances will be added to the dataset. When the process is started for the first time, initial processing must be performed. All instances present in the dataset must be analyzed, necessary values calculated, new clusters created and distances between the clusters calculated. Afterwards, once all instances have been processed, clustering on the dataset must be performed. We call this process initial clustering. The algorithm has several data structures that contain information necessary to perform the clustering effectively. Some of these structures could be abandoned, but the performance of the algorithm would then fall below usable threshold. The basic structures are: • Instance distance matrix contains a matrix of distances between all instances being clustered. This matrix is updated every time a new instance is added by inserting values returned by the distance function. All subsequent operations using a distance value retrieve the calculated value from this matrix. The matrix uses a two-dimensional dictionary.
21
22
CHAPTER 4. DESIGN
• Cluster distance matrix contains distances of all existing clusters between each other. Modifications are performed during clustering process when clusters are merged and when a cluster for newly added instance is created. This matrix also uses a twodimensional dictionary. • Clusters list maintains all clusters that exist in the current state of the clustering process. • Instances is the list of all instances in the process. Although most of the time this list contains same items as the algorithm’s data source, this allows for better clustering framework separation and resolves issues with newly added instances When an instance is processed by the algorithm, either during initial clustering or when being added to the dataset three basic operation are performed: • Instance is passed to the distance function for processing. The distance function performs necessary analysis, such as term extraction and ranking (see subsection 3.2.5) • Instance is added to the algorithm’s data structures and a distance function is invoked for pairs of existing instances and the new instances to add values to the instance distance matrix. • A new cluster is created and added to the cluster list. For similarity between clusters the minimum distance measure has been chosen. The minimum distance measure for two clusters is defined as the minimum from all distances between instances from the two clusters. This approach is used when filling the cluster distance matrix. When clusters are merged, the new cluster’s distance to cluster A is minimum of distances of the old clusters to cluster A and as only two clusters are always merged this simplifies to choosing smaller of two numbers. The clustering itself is realized within a loop that iterates until any two clusters are closer than a specified threshold. The value of this threshold is very important to the quality of clustering. Single iteration of the loop consists of following steps: • Find nearest two clusters. If their distance is above the threshold end the loop. • Create a new cluster containing instances from the two clusters. • Calculate distances to other clusters. Go to the first step. Incremental clustering is supported by performing similar instance processing as during the initial clustering and then executing the clustering loop. If the instance lies close enough to any existing cluster, merging will take place, otherwise instance’s cluster will continue to exist. The client allows emails messages to be permanently removed. This event is announced by to the clustering algorithm and clean up operations are performed. The instance is removed from instances list and the cluster it is contained in. The cluster’s distance from other clusters must then be updated. This operation is computationally far more costly than when a cluster is merged as the modified cluster’s instances must be compared with instances from all other clusters.
4.2. EMAIL MESSAGES
4.2
23
Email messages
To be able to work with emails and extract their content the framework must be able to parse them. Although basic functionality to do so is relatively easy to implements a full-featured, convenient library that will provide all necessary functions is a long-time effort. Thankfully, when integrated, the framework can make use of the library used in eM Client to work with email messages. For testing purposes this library might be used independently of eM Client.
4.3
Data instance features
The features being used as a base for the distance calculation are of the same, if not higher, importance, as the clustering algorithm itself. Large amount of features can be created from email messages properties. During analysis of what is important when deciding that items discuss the same topic, several obvious choices came up. We have also drawn from other research, mainly Cselle’s work[3]. Each of the properties has also specific method of comparison and calculation of the numeric distance value. We devised the calculation of each feature to express similarity on interval from 0 to 1. Following features participate in the distance calculation: • Most email clients add headers for unique message identification and for items that are replies a header containing identification string of the message that is being replied to. The unique identification header is named Message-Id and the header in reply messages is In-Reply-To. Value of these headers has no specific meaning and is usually randomly generated. Its only use is as operand in string comparison to determine if message are related. It is probable that related messages will discuss same topic. Numeric feature value from these properties is 1 if one mail is a reply to the other(or vice versa) and 0 if there is no relationship between the messages. • Based on the same headers as above, we check whether two mails are replies to the same email. This would be a typical case when multiple users discuss a topic. When the mails share In-Reply-To value, this feature is 1, otherwise it is 0. • Sender also bears an informational value. It is stored in From header of an email. When the sender of compared mail is identical this feature is 1. • Recipient sets (present in header field To) of email messages are compared to determine how many recipients do emails share. Size of the intersecting set related to size of union of both recipient sets is calculated. • Each email also has an origination date in header Date. Emails relating to the same topic will often be near each other on a time line. We have decided to normalize the time distance to an interval of two months. The feature value calculation computes time difference in hours divided by total hours in 2 months and performs normalization and inversion, so that value of 1 represent items not differing in time and value 0 items that are further away than 2 months.
24
CHAPTER 4. DESIGN
• Subject similarity is a very important feature. Subjects of both messages are tokenized and the number of tokens present in both subjects is divided by the total number of tokens in both texts. • The final, most important feature, is the text similarity. Our distance function retrieves the text similarity from the text analysis component, which is described later in following section. The value is normalized on interval 0 to 1.
Listed features are combined in a distance function as described in section 4.5.
4.4
Text analysis
Performing a quality text analysis and a subsequent similarity calculation should bring the highest benefit to the information value of the distance function. The text analysis component operation can be divided into tokenization, term ranking and similarity calculation as described in subsection 3.2.5. Tokenization in our framework simply splits words separated by any non-letter character. Regular expressions are used to achieve this and although not necessary, this will allow to improve the tokenizer in the future. The used regular expression is ”(\S+)\s”. Before describing the other two steps, it is necessary to list data structures that these steps use: • Terms is a dictionary of all terms in all documents, each term has an associated value defining in how many documents is the term present. This structure is updated when a new instance is being processed by the distance function. The occurrence count value is used when calculating the IDF coefficient. • CachedIDFs dictionary is used to cache Inverse Document Frequency values of terms to avoid calculating them every time a document similarity is being computed. • Documents is also a dictionary, this one contains information about Term Frequency of terms contained in a data instance - document. Each document has an associated dictionary - TermTFs of terms within that document. This nested dictionary stores a Term Frequency value for each term.
Term ranking iterates through all terms returned from the tokenization and updates the necessary items in the structures. Terms dictionary is checked for the presence of the term if it is present occurrence count is incremented, else the term is added with a value of 1. The cached IDF value for a term must be updated in CachedIDFs to reflect that the occurrence value has changed. Last update concerns Documents where the new document entry must be added and every processed term inserted into the inner dictionary. As can be seen from the data structures used, the final TF-IDF score isn’t stored anywhere. This is because a new document addition would require to recalculate the score for every contained term in every document that contains the same term, which would be highly
4.5. DISTANCE FUNCTION
25
inefficient. When a text similarity has to be calculated, both document’s entries are retrieved from Documents dictionary. Term lists from both documents are enumerated in sequence. For each term TF-IDF is calculated from values in CachedIDFs and TermTFs and variables holding values to calculate cosine measure numerator and denominator are updated with the TF-IDF value.
4.5
Distance function
The distance function combines all intermediate similarity values to form one resulting distance value. Linear combination of the intermediate values is used and its coefficients are feature weights. The value then needs to be inverted to become a distance. As each feature value was normalized to a known interval, we may calculate the maximal achievable value and subtract the sum of current values. Final step is to normalize the resulting distance and this is achieved by dividing by the maximal achievable value. The complete formula has following form: P P m mmax − m mval P distance = m m max P ,where m enumerates over all features, mmax is the maximal value of feature m and mval is the value for feature m for current document.
4.6
Database backend
The data used by components from this chapter needs to be stored in a persistent location. There are high demands on the performance of the persistent store, so a suitable database engine needs to be used. eM Client uses relational database engine SQLite, which is very lightweight and provides good performance. For compatibility reason and based on previous positive experience we have chosen to use SQLite also in the clustering framework. eM Client also employs a typical two layer model for objects with storage objects directly representing the database data and application objects operating above storage objects with a higher-level logic. The storage objects are further managed by repositories. We have closely mimicked this architecture when designing persistence classes for this project. There are several types of application objects and collections for enumeration of these objects. The objects represent cluster, data instance, document information (contains list of document’s terms) and term. The collections are list, dictionary and two-key dictionary. Detailed form of the model is described in section 5.3 and a database schema is attached in Appendix A.
4.7
User interface
The interface we will add to eM Client needs to fit in well with other parts of the software. The client contains a panel on the right side of the main window, where contact list, communication and attachment history are placed. The side bar is a non-intrusive, yet handy
26
CHAPTER 4. DESIGN
location and our functionality is similar to the other components on the panel - to make user’s work with email easier. Each component in the side bar is represented by a box control that needs to be implemented. Visual layout of the control should be separated to two areas - a list containing existing clusters and an area containing details about each cluster. The clustering interface should actively react when the user selects an email and lookup the cluster to which the email belongs. This way the user will always be presented with the clustering context of the email. When a cluster is selected either from a list or by being looked up when an email is selected, it should allow very easily to filter emails just from the cluster. The interface should hide if the currently selected folder isn’t being clustered.
4.8
Integration layer
To interconnect the clustering framework with eM Client, an integration layer needs to be created. For such relatively standalone functionality singleton managing classes are being used in eM Client. These classes are initialized at startup and autonomously perform necessary operations, based on events from application classes, collections or from other sources. The class will need to carry out following tasks: • Manage background thread in which all clustering operations will be invoked. All events must be delegated to this thread. • Act as a data source for the clustering algorithm. This consists of enumerating items from client’s folder, creating matching data instance objects and monitoring additions and removals from the clustered folder and announcing these events in transformed form to the clustering framework. • Execute initialization tasks when clustering is launched for the first time and actions based on user’s choices.
Chapter 5
Implementation Based on the design from previous chapter, we have implemented the clustering framework, integration code and integrated the framework into the client. The platform used for implementation is Microsoft .NET Framework. To maintain compatibility with eM Client and its requirements, we use version 2.0 of the .NET Framework. Development environment has been Microsoft Visual Studio 2008. When designing the object model of the project we made heavy use of interfaces to ensure that individual components will not depend on specific implementation and any parts of the process may be easily swapped for a different implementation of the interface. The project’s architecture can be divided into several blocks, which we will describe in following sections. These sections are concerned only with the general architecture consisting of interfaces and with classes that are used by the framework when integrated with eM Client. There are also many classes that were used only when working with the framework as standalone application. These classes either implement the interfaces to be described in a different manner (e.g. to access email files from a disk folder) or provide other helper functionality. In case of interest in these classes, they are available on the attached CD.
5.1
Clustering
This part contains all infrastructure related with the clustering itself. The elementary item interfaces for cluster and data instances are ICluster and IInstance. Clustering algorithm is represented by IClusterer, which depends on a source of instances defined by IDataSource. The algorithm also needs a metric that calculates the instance similarity and this metric is IDistance. ICluster interface contains properties Identifier for textual cluster identification and Instances which is a dictionary of instances paired with a membership degree. In the current implementation, where hierarchical clustering is used, the degree of membership is always 1. IInstance has only one string property Identifier that provides string for textual identification. Any other more specific members of the generic instance do not exist. The clustering algorithm doesn’t need to access any instance properties, since it lets IDistance to do the evaluation and it is expected that IDistance will be implemented to work with concrete IInstance implementation.
27
28
CHAPTER 5. IMPLEMENTATION
Since we support online clustering throughout the framework, but necessarily do not require the data source to be of online nature, members specific to online data source have been separated in IDataSource descendant IOnlineDataSource. Our implementation of the IClusterer is HierarchicalClusterer class. In section 4.1 we have described several data structures that this algorithm uses. To separate implementation of the data structures - we have used in-memory structures during testing and databasebacked structures are implemented in the final version - an interface IHierarchicalClustererContext is used by the hierarchical clusterer class to access its data.
5.2
Mail clustering
To support clustering of emails several of the above interfaces had to be implemented to perform specific operations. Several other classes and interfaces had to be added in order to correctly structure the code. Interface IMailInstance has been created as a descendant of IInstance to provide access to the email messages the instances represent. The email is exposed through members Mail. IDistance was implemented in MailDistance to perform distance calculations as described earlier. However, since text clustering is a more complex process, MailDistance delegates all operations related to document clustering to MailTextAnalyzer. Class MailTextAnalyzer performs processing of new mail documents and text similarity calculation. This, requires data structure filled by analysis and used afterwards for calculations. Same approach as with hierarchical clustering has been used and an interface named IMailTextAnalyzerContext hiding the storage implementation from the analyzer was created.
5.2. MAIL CLUSTERING
29
Figure 5.1: Architecture of the core clustering framework and associated classes for mail clustering.
30
CHAPTER 5. IMPLEMENTATION
5.3
Storage
Storage classes provide persistence to other parts of the framework. The storage classes operate over a relational database, but completely shield the rest of the project from having to manage the database. The persistence is exposed through implementations MailHierarchicalClustererContext and MailTextAnalyzerContext of context interfaces IHierarchicalClustererContext and IMailTextAnalyzerContext. These implementations then access application objects described below. Database storage is separated into two levels: • The data level directly cooperates with the relational database, it executes SQL queries for inserting, updating, removing or listing items. Each type of item, such as cluster or mail data instance has its own interface inherited from IRepositoryItem. Data objects implementing this interface are then used for in-memory storage of the item’s properties. Each type of item is managed by a central repository class inherited from base class DbRepository. Note that class DbRepository has been taken from eM Client’s storage layer. The repository takes care of all database interaction and manages a cache of storage objects. Specialized abstract class serving as a base for repositories that are encapsulated by two key dictionary application level collection was implemented in DbDoubleKeyRepository • The application level provides an abstraction layer from the database. Item application classes encapsulate the data storage items and trigger modification operations in the repository. Collection classes implement standard platform operations for enumeration, addition and removal of items. Matrices that are used in IHierarchicalClustererContext are implemented as descendants of class DoubleKeyDictionary.
5.4
User interface
The user interface in the client has been implemented correspondingly to section 4.7. A class ControlSidebarBoxClusters is a descendant of a control that supports embedding in the sidebar. The inherited class contains a datagrid control that displays the list of clusters and a panel that where information about the current cluster is displayed. These two controls change their visibility on folder selection change and are visible only if the current folder is being clustered. Information in the detail panel are refreshed when either the user selects a cluster from the list or when a mail is selected in the main area of eM Client. The user interface also allows to filter messages from cluster. When a user double clicks an item in the cluster list or clicks ”Filter cluster items” button in the detail panel, filtering is enabled in the messages area of the application and only messages from current cluster are visible. Integration with other user interface components of the client is implemented in several methods in the main form of the application - formMain.
5.4. USER INTERFACE
31
Other interaction with the user, such as confirming actions of the clustering framework is implemented through message boxes in the class that manages clustering and is described in following section.
Figure 5.2: Fragment of the eM Client’s main form showing ControlSidebarBoxClusters’s look.
32
5.5
CHAPTER 5. IMPLEMENTATION
Integration layer
To interconnect the clustering framework with the client a class controlling the clustering and serving as a bridge for data in both directions needs to exist. This class is named ClusteringManager and it follows the Singleton design pattern. Upon eM Client startup, an initialization method of this class is executed. The initialization sets up all components of the clustering framework including database-backed contexts for hierarchical clusterer and text analysis. A background thread is also created during the initialization. All actions such as initial clustering, mail item addition or removal are delegated to this thread and passed to the clusterer class. Therefore lengthy clustering operations do not block execution of threads that originated the events. This thread has also background priority to avoid slowing program’s operation down. ClusteringManager implements IOnlineDataSource to provide instance to the clusterer. Mail items in clustered folder are enumerated and a corresponding object based on IMailInstance is created for each item. Folder’s events are being monitored and trigger analogous events of online data source interface. To allow retrieving mails matching to data instances an another interface IMailLoader is used by data instance objects. This interface is used to look up the mail item based on its identifier stored in the instance object whenever the instance object needs to access its email and does not already have a reference to the email item. Last role of the ClusteringManager is to expose the list of clusters from the framework. This is achieved by a property that returns the cluster list present in the hierarchical clusterer context interface.
Chapter 6
Testing During and after the project’s implementation we have tested the functionality on a dataset that represented a portion of author’s inbox. Performance of the clustering has been analyzed in regard to memory and time. Quality has been analyzed in two ways: by subjective evaluation and using cophenetic correlation coefficient.
6.1
Clustering quality
6.1.1
Subjective evaluation
First a subjective evaluation of the clusters’ quality was carried out - judging whether they correspond to real topics of the processed emails and if the aggregation does ease orientation in the sample dataset. The result of this review was positive - emails were grouped into convenient clusters that distinguished individual topics. To illustrate this, we have extracted a dendrogram of the sample dataset:
33
34
CHAPTER 6. TESTING
Figure 6.1: Dendogram of the sample dataset. The red line denotes maximal distance of clusters to be merged, colored boxes encapsulate multi-item clusters presented to the user.
As can be seen, 8 clusters containing more than 1 email were created and each represents a compact topic corresponding to how author thinks of the contained emails. To let the reader judge for himself, emails’ subjects matched to leafs in the dendogram can be found in Appendix B. Emails that were not merged with other items are also represented as clusters in the framework, but will not be shown to the user in future.
6.1. CLUSTERING QUALITY
6.1.2
35
Objective evaluation
To perform objective evaluation of the clustering, we had to select one of known quality measures. Since the testing dataset wasn’t manually processed to assign a target cluster to each data instance, we have to restrict our choice to internal quality measures. One of such measures, aimed at hierarchical clustering, is a cophenetic correlation coefficient. This coefficient tells us, how well does the clustering represent the instances’ relations based on their distance. Two sets of values are compared: the instance distance matrix and the distance of individual cluster in the dendogram. The calculation of the coefficient was done in MathWorks MATLAB, which contains necessary functions. The coefficient value may lie in an interval from 0 to 1, where 1 indicates an absolute correlation of clusters’ distances with the instances’ distance. We have performed several experiments on the same dataset as in the previous section. There were two elementary properties, whose impact on the clustering we have tested: the distance function formula section 4.5 and the cluster distance metric of the hierarchical clustering subsubsection 3.2.1.2. The function that calculates the final distance between instances may have many forms, our current implementation uses weighted linear combination, but there are other two very popular choices - Euclidean distance and Cosine measure. Since Cosine measure requires component values of instance vectors, which is not compatible with the way our clustering works, only the Euclidean distance remains. We have modified the distance function and ran the clustering for the standard and the alternative formula. The resulting distance matrices were extracted and used in testing described below. The second property we modified is the cluster distance metric, which has major impact on the shape of the clusters. The available possibilities are: single (which is used in our implementation), average and complete linkages. Using extracted instance distance matrices for each distance function and different cluster metrics we have performed series of calculations of the cophenetic correlation coefficient in MATLAB and compiled the following comparison table:
Single Average Complete
Euclidean 0.965 0.974 0.958
Linear comb. 0.960 0.963 0.934
The data show that Euclidean instance distance measurement is able to perform slightly better than linear combination, which has been implemented at first. Therefore we have modified the framework to use Euclidean distance by default. Cluster metrics indicate that average and single linkage perform better for our clustering task, but single linkage was kept in use for now. The differences between values are quite subtle and therefore a further evaluation on several larger datasets will need to be done to get more confident results.
36
6.2
CHAPTER 6. TESTING
Data structures optimization
Data structures used in context interfaces (see section 5.3) are linked with a database, which means that every operation such as modification, enumeration or look up performs a SQL query. While this is not a problem for lists such as Clusters or Instances, where modifications and enumerations are performed at sustainable rate, it becomes an issue for two dimensional dictionaries used to store instance distance and cluster distance matrices, which are heavily utilized during the clustering process (their utilization corresponds do the algorithm’s complexity - n2 ). We have attempted to improve the situation at least for the read operation by implementing a cache. An inner, in-memory two dimensional dictionary has been added and once a value for each key pair is modified or retrieved, it is stored in the cached dictionary. All subsequent accesses are catered from the cache. Another substantial optimization to be implemented in future may be taking advantage of the symmetric nature of the distance function. This allows the size of the distance matrices to be reduced by more than a half.
6.3
Performance
The hierarchical clusterer’s greatest disadvantage is its high resource consumption. We have described an optimization to reduce disk access, which in result dramatically improves the speed of the clustering. Measurements have shown an improvement of 30 To get a notion of the memory consumption of the whole clustering framework, we have used memory allocation monitoring on two versions of eM Client - with and without the clustering implemented. This is not a byte-precise approach due to the nature of .NET Framework, but sufficient for our needs. We have executed initial clustering in the first version on a dataset with 70 items. The resulting difference was close to 6 megabytes of allocated memory. The overall speed and memory performance wasn’t found to be up to the author’s expectations and will be subject to further improvements. Luckily, there is a large space for enhancement and thus getting the resource consumption to product grade standards will be achievable.
Chapter 7
Conclusion The author has decided to realize an email clustering framework as an approach to evolve the three decades old model of work with email. The motivation comes from necessity to cope with contemporary problems of email overload. This approach is very different from well known takes on the issue. It does not require user’s interaction, but is able to continually provide alternative organization of the inbox that will likely correspond to how the user thinks about the clustered emails. It is also non-intrusive, therefore has high chances of successful adoption by users. In this work we went through relevant concepts, algorithms and formulas that will power the clustering framework. After choosing suitable techniques we have described their design and individual features they will exhibit. Afterwards, an implementation of the whole framework has been described along with reasoning for decisions that were made. The implementation was realized with aim to allow integration into an existing email client and with an emphasis on possibility of future expansion and improvement of the functionality. Furthermore, a convenient user interface that suits the target software’s environment has been devised. Finally, testing and performance evaluation yielded relevant information, which led to several optimizations and will be used as a guide in future work on improving performance. We can conclude that we have succeeded in realizing goals laid out at the beginning of our work.
7.1
Future work
Although a fair amount of effort went into the project so far, it needs much more to get to a state, where author would like to see it - as mature component performing advanced, high quality clustering and having many features further enhancing this new element in work with email. Also integration with other clients is a long-term goal. The main tasks for future evolution of this project are: • Optimize operations with persistent data structures such as batch writing using transaction scopes, intelligent caching making use of weak references, saving space when having symmetric matrices.
37
38
CHAPTER 7. CONCLUSION
• Improve the clustering speed, optimize the algorithm as much as possible, add support for parallelization. • Perform large scale experimenting with model parameters using automated procedures based on optimizing quality measures, both internal and external[12]. • Investigate possibilities of automatic model parameter adjustment, parameter profiles or user based parameter adjustments. • Allow user based cluster modification - merge cluster, separate cluster, move items between clusters. • Investigate on extending user’s perception of cluster as category and on possible interconnection of rules with clusters. • Work on Fuzzy C-Means algorithm, compare with current algorithm, evaluate advantages of the fuzzy membership from both user and programmatic standpoint, add incremental support, explore effect on the model upon external modification of the clustering and more.
Bibliography [1] New York Times - Is Information Overload a 650 Billion Drag on the Economy? http://bits.blogs.nytimes.com/2007/12/20/is-information-overload-a-650-billion-dragon-the-economy/. [2] Andrew McCallum Ron Bekkerman and Gary Huang. Automatic categorization of email into folders: Benchmark experiments on enron and sri corpora. [3] Gabor Cselle. Organizing email. Master’s thesis, ETH Zurich, 2006. [4] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. The Morgan Kaufmann Publishers, 2nd edition, 2006. [5] Andr´e Ponce de Leon F. de Carvalho Eduardo J. Spinosa and Joao Gama. An online learning technique for coping with novelty detection and concept drift in data streams. [6] Matteo Matteucci. Clustering - Fuzzy C-Means Clustering. http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/cmeans.html. [7] D. S. Yeung and X. Z. Wang. Improving performance of similarity-based clustering by feature weight learning. [8] Marko Grobelnik Dunja Mladeni´c, Janez Brank and Natasa Milic-Frayling. Feature selection using linear classifier weights: interaction with classification models. [9] P. Resnick. RFC 2822: Internet message format. [10] N. Freed and N. Borenstein. RFC2045: Multipurpose internet mail extensions (mime) part one. [11] eM Client. http://www.emclient.com. [12] George Karypis Michael Steinbach and Vipin Kumar. A comparison of document clustering techniques.
39
40
BIBLIOGRAPHY
Appendix A
SQL Schema 1 2 3 4 5 6 7 8 9 10 11
12
13 14
15
ATTACH PRAGMA PRAGMA CREATE
DATABASE ’ clusters . dat ’ as clusters ; clusters . auto_vacuum = full ; clusters . page_size =4096; TABLE clusters . " Clusters " ( " id " INTEGER NOT NULL PRIMARY KEY , " identifier " TEXT , " l a t e s t M a i l O r i g i n a t i o n D a te " TIMESTAMP , " color " INTEGER ) ; CREATE TABLE clusters . " ClusterInstances " ( " id " INTEGER NOT NULL PRIMARY KEY , " parentId " INTEGER NOT NULL CONSTRAINT f k _ C l u s t e r I n s t a n c e s _ p a r e n t I d REFERENCES " Clusters " ( " id " ) ON DELETE CASCADE , " instanceId " INTEGER NOT NULL CONSTRAINT f k _ C l u s t e r I n s t a n c e s _ i n s t a n c e I d REFERENCES " Instances " ( " id " ) ON DELETE CASCADE , " membershipDegree " FLOAT ) ; CREATE INDEX clusters . " idx_Parent " ON " ClusterInstances " ( " parentId " ) ; DETACH DATABASE clusters ;
16 17 18 19 20 21 22 23 24
ATTACH PRAGMA PRAGMA CREATE
DATABASE ’ instances . dat ’ as instances ; instances . auto_vacuum = full ; instances . page_size =4096; TABLE instances . " MailInstances " ( " id " INTEGER NOT NULL PRIMARY KEY , " identifier " TEXT , " mailIdentifier " TEXT ) ; DETACH DATABASE instances ;
25 26
27 28 29
30 31 32
ATTACH DATABASE ’ instance_distances . dat ’ as instance_distances ; PRAGMA instance_distances . auto_vacuum = full ; PRAGMA instance_distances . page_size =4096; CREATE TABLE instance_distances . " I ns ta nce Di st anc eM at rix " ( " id " INTEGER NOT NULL PRIMARY KEY , " key1 " TEXT , " key2 " TEXT ,
41
42
APPENDIX A. SQL SCHEMA
33 34
35
" value " FLOAT ) ; CREATE INDEX instance_distances . " idx_Keys " ON " In st an ceD is ta nce Ma tr ix " ( " key1 " , " key2 " ) ; DETACH DATABASE instance_distances ;
36 37
38 39 40 41 42 43 44 45
46
ATTACH DATABASE ’ cluster_distances . dat ’ as cluster_distances ; PRAGMA cluster_distances . auto_vacuum = full ; PRAGMA cluster_distances . page_size =4096; CREATE TABLE cluster_distances . " Cl uster Dist ance Matri x " ( " id " INTEGER NOT NULL PRIMARY KEY , " key1 " TEXT , " key2 " TEXT , " value " FLOAT ) ; CREATE INDEX cluster_distances . " idx_Keys " ON " Clu ster Dista nceM atrix " ( " key1 " , " key2 " ) ; DETACH DATABASE cluster_distances ;
47 48 49 50 51 52 53 54 55 56
57
ATTACH PRAGMA PRAGMA CREATE
DATABASE ’ terms . dat ’ as terms ; terms . auto_vacuum = full ; terms . page_size =4096; TABLE terms . " Terms " ( " id " INTEGER NOT NULL PRIMARY KEY , " key1 " TEXT , " key2 " TEXT , " documentOccurrences " INTEGER ) ; CREATE INDEX terms . " idx_Keys " ON " Terms " ( " key1 " , " key2 " ) ; DETACH DATABASE terms ;
58 59 60 61 62 63 64 65 66 67 68 69
70 71 72
73
ATTACH PRAGMA PRAGMA CREATE
DATABASE ’ document_infos . dat ’ as document_infos ; document_infos . auto_vacuum = full ; document_infos . page_size =4096; TABLE document_infos . " DocumentInfo " ( " id " INTEGER NOT NULL PRIMARY KEY , " key1 " TEXT , " key2 " TEXT , " termsCount " INTEGER ) ; CREATE TABLE document_infos . " DocumentTerms " ( " id " INTEGER NOT NULL PRIMARY KEY , " parentId " INTEGER NOT NULL CONSTRAINT f k _ C l u s t e r I n s t a n c e s _ p a r e n t I d REFERENCES " Clusters " ( " id " ) ON DELETE CASCADE , " term " TEXT , " tf " FLOAT ) ; CREATE INDEX document_infos . " idx_Parent " ON " DocumentTerms " ( " parentId " ) ; DETACH DATABASE document_infos ;
74 75 76 77 78 79 80 81
ATTACH PRAGMA PRAGMA CREATE
DATABASE ’ idfs . dat ’ as idfs ; idfs . auto_vacuum = full ; idfs . page_size =4096; TABLE idfs . " CachedIDFs " ( " id " INTEGER NOT NULL PRIMARY KEY , " key1 " TEXT , " key2 " TEXT ,
43
82 83
84
" value " FLOAT ) ; CREATE INDEX idfs . " idx_Keys " ON " CachedIDFs " ( " key1 " , " key2 " ) ; DETACH DATABASE idfs ;
44
APPENDIX A. SQL SCHEMA
45
46
APPENDIX B. CLUSTERING ILLUSTRATIONS
Appendix B
Clustering Illustrations
Figure B.1: Dendogram of the sample dataset with subject of each email.
Appendix C
CD directory structure The attached CD has following directory structure: \ bin - Directory containing executable copy of eM Client with clustering functionality ( executable is eM Client . exe ) \ clustering - Contains the source code of the clustering framework ( project file is Clustering . csproj ) \ text - Contains this text ( PDF is grafnl . pdf , Latex file grafnl1 . tex )
47