A Software System for Topic Extraction and Document Classification Davide Magatti and Fabio Stella Department of Informatics, Systems and Communications Universit degli Studi di Milano-Bicocca Milan, Italy Email: {magatti, stella} @disco.unimib.it
Abstract A software system for topic extraction and automatic document classification is presented. Given a set of documents, the system automatically extracts the mentioned topics and assists the user to select their optimal number. The user-validated topics are exploited to build a model for multi-label document classification. While topic extraction is performed by using an optimized implementation of the Latent Dirichlet Allocation model, multi-label document classification is performed by using a specialized version of the Multi-Net Naive Bayes model. The performance of the proposed conceptual model is investigated by using 10,056 documents retrieved from the WEB through a set of queries formed by exploiting the Italian Google Directory. This dataset is used for topic extraction while an independent dataset, consisting of 1,012 elements labeled by humans, is used to evaluate the performance of the Multi-Net Naive Bayes model. The results are satisfactory, with precision being consistently better than recall for the labels associated with the four most frequent topics.
1. Introduction The continuously increasing amount of text available on the WEB, news wires, forums and chat lines, business company intranets, personal computers, e-mails and elsewhere is overwhelming [1]. Information, is switching from useful to troublesome. Indeed, it is becoming more and more evident that while the amount of data is rapidly increasing, our capability to process information remains constant. This trend strongly limits the extraction of valuable knowledge from text and thus drastically reduces the competitive advantage we can gain. Search engines have exacerbated such a problem by dramatically increasing the amount of text available in a matter of a few key strokes. While Wigner [2] examines the reasons why a great body of physics can be neatly explained with elementary mathematical formulas, the same does not apply to sciences that involve human beings. Indeed, in this case it is becoming increasingly evident how complex theories will never
Marco Faini DocFlow Italia S.p.A. Centro Direzionale Milanofiori, Strada 4 Palazzo Q8 20089 Rozzano, Italy Email:
[email protected]
share the elegance of physics. Therefore, when dealing with sciences that involve human beings ... we should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data. as discussed by Alon Halevy, Peter Norvig and Fernando Pereira in their recent paper titled “The Unreasonable Effectiveness of Data” [3]. The authors conclude their paper by stating Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. and by giving the following suggestion Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. In this paper the authors share this point of view and follow the suggestion to design and implement a software prototype for topic extraction and multi-label document classification. The software prototype implements a conceptual model which processes a corpus of unstructured textual data to discover which topics are mentioned, and then exploits the extracted topics to learn a supervised classification model for multi-label document classification. The rest of the paper is organized as follows; Section 2 introduces Text Mining and gives basic elements concerning the Latent Dirichlet Allocation model and the Multi-Net Naive Bayes model. Section 3 describes the main components of the software prototype along with their functionalities. Section 4 is devoted to numerical experiments and to multi-label document classification performance evaluation. Finally, Section 5 presents conclusions and discusses further developments.
2. Text Mining Text Mining (TM) [1], [4] is an emerging research area which aims to solve the problem of information overload, i.e. to automatically extract knowledge from semi and unstructured text. Therefore, TM can be interpreted as the natural answer to the unreasonable effectiveness of data which aims to efficiently memorize the huge amount of data available through the WEB. The main techniques used by TM are; data mining, machine learning, natural language processing, information retrieval and knowledge management. Typical tasks of TM are; text categorization, document clustering and organization, and information extraction.
2.1. Probabilistic Topic Extraction Probabilistic Topic Extraction (PTE) is a particular form of document clustering and organization used to analyze the content of documents and the meaning of words with the aim to discover the topics mentioned in a document collection. A variety of models have been proposed, described and analyzed in the specialized literature [5], [6], [7], [8]. These models differ among them in terms of the assumptions they make concerning the data generating process. However, they all share the same rationale, i.e. a document is a mixture of topics. To describe how the PTE model works, let P (z) be the probability distribution over topics z, P (w|z) be the probability distribution over words w given topic z. The topic-word distribution P (w|z) specifies the weight to thematically related words. A document is assumed to be formed as follows: the ith word wi is generated by first extracting a sample from the topic distribution P (z), then by choosing a word from P (w|z). We let P (zi = j) to be the probability that the j th topic was sampled for the ith word token while P (wi |zi = j) is let to be the probability of word wi under topic j. Therefore, the PTE induces the following distribution over words within a document: P (wi ) =
T X
P (wi |zi = j) P (zi = j)
(1)
j=1
where T is the number of topics. Hofmann [6], [9] proposed the probabilistic Latent Semantic Indexing (pLSI) method which makes no assumptions about how the mixture weights in (1), i.e. P (zi = j), are generated. Blei et al. [7] improved the generalizability of this model to new documents. They introduced a Dirichlet prior, with hyperparameter α, on P (zi = j), thus originating the Latent Dirichlet Allocation (LDA) model. In 2004, Griffiths and Steyvers [10] introduced an extension of the original LDA model which associate a Dirichlet prior, with hyperparameter β, also to P (wi |zi = j). The authors suggested the hyperparameter to be interpreted as the prior
observation count on the number of times words are sampled from a topic before any word from the corpus is observed. This choice can smooth the word distribution in every topic with the amount of smoothing determined by β. The authors showed that good choices for the hyperparameters α and β will depend on number of topics T and vocabulary size W , and that accordingly to the results of their empirical investigation α = 50 T and β = 0.01 to work well with many different document collections. Topic extraction, i.e. estimation of the topic-word distributions and topic distributions for each document, can be implemented through different algorithms. Hofmann [6] used a direct estimation approach based on the ExpectationMaximization (EM) algorithm. However, such an approach suffers from problems involving local maxima of the likelihood function. A better alternative has been proposed by Blei et al. [7] which directly estimate the posterior distribution over z given the observed words w. However, many text collections contain millions of word tokens and thus the estimation of the posterior over z requires the adoption of efficient procedures. Gibbs sampling, a form of Markov Chain Monte Carlo (MCMC), is easy to implement and provides a relatively efficient method of extracting a set of topics from a large corpus. It is worthwhile to mention that topic extraction can be performed when the number of topics T is given. However, in many cases this quantity is unknown and we have to resort to empirical procedures and/or to statistical measures such as perplexity to choose the optimal number of topics to be retained after the unsupervised learning task has been performed.
2.2. Text Categorization Document classification, the task of classifying natural language documents into a predefined set of semantic categories, has become one of the key methods for organizing online information. It is commonly referred to as text categorization and represents a building block of several applications such as: web pages categorization, newswire filtering and automatic routing of incoming messages at call centers, ... Text categorization is distinguished into binary, multiclass and multi-label settings. In the binary setting there are exactly two classes, i.e. relevant and non relevant, spam and non spam or sport and non sport. Some classification tasks require more than two classes, i.e. an e-mail routing agent at a call center need to forward an incoming message to the right operator depending on the specific nature of the message contents. Such cases belong to the multi-class setting where documents can be labeled with exactly one out of K classes. Finally, in the multi-label setting there is no one-to-one correspondence between class and document. In
such a setting, each document can belong to many, exactly one or no category at all. Several supervised learning models have been described in the specialized literature to cope with text categorization. However, the most studied models are Support Vector Machines and Naive Bayes. The Support Vector Machines (SVMs) approach has been proposed by Joachims [11] and has been extensively investigated by many researchers [12], [13], [14], [15] to mention a few. The SVM model has been further extended to cope with cases when little training data is available. For example, a news-filtering service, which requires thousands of labeled data, it is very unlikely to please even the most patient user. To cope with such a case the Transductive Support Vector Machines (TSVMs) is presented and discussed in [16] and further developed in [17], [18], [19]. Several works have extensively studied the Naive Bayes model for text categorization [20], [21]. However, these pure Naive Bayes classifier models considered a document as a binary feature vector, and so they cannot utilize the term frequencies in a document, resulting in poor performances. The multinomial Naive Bayes text classifier has been shown to be an effective alternative to the basic Naive Bayes model by a number of researchers [22], [23], [24]. However, the same researchers have also given disappointing results compared to many other statistical learning methods, such as nearest neighbor classifiers [25], support vector machines [11] and boosting [26]. Recently SangBum Kim et al. [27] revisited the naive Bayes framework and proposed a Poisson Naive Bayes model for text classification with a statistical feature weighting method. Feature weighting has many advantages when compared to previous feature selection approaches, especially when the new training examples are continuously provided. In this paper a Multi-Net Poisson Naive Bayes model has been used to implement the multi-label document classification functionality.
are judged to be too frequent/rare within a document and/or across the document corpus. This software component exploits a general purpose Italian vocabulary to obtain the word-document matrix following to the bagof-words model. Furthermore, the following document representations are allowed; binary, i.e. a word token belongs or not to a given document, term frequency, i.e. how may times a word token is mentioned in a given document, and term frequency inverse document frequency first introduced in [28]. The following document formats are valid inputs for the software system; pdf, word and txt. The user is allowed to interact with the Text Pre-Processing software component through the GUI depicted in Figure 1. •
Topic Extractor. This software component offers topic extraction and topic selection functionalities. The topic extraction functionality is implemented through a customized version of the Latent Dirichlet Allocation (LDA) model [10]. LDA learning, i.e. topic extraction, is obtained by using the Gibbs sampling algorithm which has been implemented in the C++ programming language on a single processor machine. The topic selection functionality assists the user in choosing the optimal number of topics to be retained. Topic selection is implemented through a hierarchical clustering procedure based on the symmetrized Kullback Liebler distance between topic distributions [8]. Each retained topic z = j is summarized through the estimate of its prior probability P (z = j), a sorted list of its most frequent words w, together with the estimate of their conditional probabilities of occurrence given the topic, i.e. the value of the conditional probability P (w|z = j).
3. The Software System The software system described in this paper consists of three main components; namely Text Pre-processor, Topic Extractor and Multi-label Classifier. These components, which are also available as stand-alone applications have been integrated to deploy a software system working on Windows XP and Vista operating systems. A brief description of the aims and functionalities offered by the three software components follows. •
Text Pre-processor. This software component implements functionalities devoted to document preprocessing and document corpus representation. It offers stopwords removal, different word stemming options and several filters to exclude those words which
Figure 1. Text Pre-Processor GUI.
•
Multi-label Classifier. Implements a supervised multilabel classification model. This model exploits the output from the Topic Extractor software component, i.e. conditional probability distribution of words given the topic. It uses a customized version of the MultiNet Poisson Naive Bayes (MNPNB) model [27]. The MNPNB model allows to select the following bagof-words representations; binary, term frequency and term frequency inverse document frequency. Each new document, represented according to the bag-of-words model, is automatically labeled depending on the user specified value for the posterior threshold. If the posterior probability of a given topic is greater than the user specified posterior threshold, then the document receives the label associated with the considered topic. This software component has been implemented in C# and is available through dedicated GUIs (Figure 2 and Figure 3) as well as a WEB service.
4. The Italian Google Directory The performance of the software system is investigated using a document corpus collected by exploiting the topics structure offered by the Italian Google Directory (gDir)1 . This topic structure (Figure 4) relies on the Open Directory Project (DMOZ) which manages the largest human-edited directory available on the web; each branch of the directory can be extended only by editors having a deep knowledge about the specific topic to be edited. Furthermore, the editors’ community guarantees fairness and correctness of the gDir topics structure. The topics associated with the first level of gDir are the following; ACQUISTI , A FFARI , A RTE , C ASA , C OM PUTER , C ONSULTAZIONE , G IOCHI , N OTIZIE , R EGIONALE ,
Figure 3. MNPNB classifier: results GUI. S ALUTE , S CIENZA , S OCIETA , S PORT, T EMPO L IBERO. Each first level topic is associated with a second and third level sub-topic structure summarized in a words list.
4.1. Document Corpus The document corpus has been collected by submitting a set of 273 queries to the Google search engine. Each query contains a pair of words, randomly selected from the union of the words lists associated to second and third level subtopic structures.
Some examples of the submitted random queries, are as follows; "veicoli realt´ a virtuale", "body art
1. http://www.google.com/Top/World/Italiano/.
Figure 2. MNPNB classifier: labeling GUI.
Figure 4. The Italian Google Directory.
altre culture" and "multimedia politica". The PDF filter, offered by the Google search engine, has been used to ensure that only pdf format files are retrieved. The random query process retrieved 14,037 documents, each associated with one or more gDir first level topics.
4.2. Text Pre-processing The document corpus, consisting of the 14,037 retrieved documents, has been submitted to the Text Pre-processor software component. Therefore, PDF files have been transformed to plain text, submitted to stopwords removal and to word stemming. Furthermore, size-based file selection has been applied to include only those PDF files with size between 2 and 400 KB. The obtained document corpus consists of 10,056 documents (D = 10, 056). The global vocabulary, which has been formed by including only those words occurring at least in 10 and in no more than 450 documents, consists of 48,750 word tokens (W = 48, 750). The document corpus is represented in a word-document matrix W D consisting of W × D (term frequency) elements (48, 750 × 10, 056). The word-document matrix is inputed to the Topic Extractor software component as described in the next subsection.
4.3. Topic Extraction The Topic Extractor software component has been invoked with the following learning parameters; 12 topics (T = 12), alpha prior equal to 1.67 (α = 1.67), which implements the α = 20 T rule cited in [7], beta prior equal to 0.01 (β = 0.04), which implements the β = 200 W rule cited in [7]. The Gibbs sampling procedure has been run 100 times with different initial conditions and different initialization seeds. Each run consisted of 500 sampling iterations. The topics extracted through the last 99 runs of the Gibbs sampling learning procedure have been re-ordered to correspond as best as possible with the topics obtained through the first run. Correspondence was measured by means of the simmetrized Kullback Liebler distance. The 12 topics extracted from the Topic Extractor software component have been summarized through their corresponding 500 most frequent words. Among the extracted topics, the four most interesting ones have been manually labeled as follows; • S ALUTE (medicine and health), • C OMPUTER (information and communication technologies), • T EMPO L IBERO (travels and holidays), • A MMINISTRAZIONE (bureaucracy, public services). The structure of the four topics, is described in Table 1 and Table 2. Each topic is associated with an estimate of its prior probability;
P (S ALUTE) = 0.0787, P (C OMPUTER) = 0.0696, • P (T EMPO L IBERO ) = 0.0884 • P (A MMINISTRAZIONE) = 0.1021. Furthermore, each topic is summarized in a words list, whose 20 most frequent words are reported in Table 1 and Table 2. It is worthwhile to mention that for each pair of words wi and topics j, an estimate of the conditional probability of the word wi given the topic j, i.e. P (wi |j) (Eq. 1), is provided. •
•
4.4. Multi-Label Classification The performance of the software system, as a whole, has been estimated by submitting a new document corpus to the Multi-label Classifier. This document corpus has been Table 1. S ALUTE and C OMPUTER. S ALUTE cellule emissioni nutrizione molecolare proteine dieta climatici foreste cancro aids disturbi infermiere cibi tumori veterinaria obesit´a clinico serra virus infezioni
0.0787 0.0032 0.0029 0.0028 0.0026 0.0022 0.0022 0.0021 0.0021 0.0021 0.0020 0.0020 0.0019 0.0019 0.0019 0.0018 0.0018 0.0018 0.0017 0.0017 0.0017
C OMPUTER blog google linux copyright wireless source access client multimedia hacker password giornalismo browser provider telecom brand book chat wiki piattaforme
0.0696 0.0063 0.0035 0.0034 0.0033 0.0030 0.0029 0.0028 0.0027 0.0027 0.0026 0.0026 0.0025 0.0023 0.0022 0.0022 0.0022 0.0021 0.0021 0.0021 0.0021
Table 2. T EMPO L IBERO and A MMINISTRAZIONE. T EMPO L IBERO sconto aeroporto salone spiaggia lago colazione albergo vacanza piscina vini bagni voli pensione biglietto notti escursioni agevolazioni archeologico piatti bicicletta
0.0884 0.0077 0.0038 0.0035 0.0028 0.0028 0.0026 0.0025 0.0025 0.0024 0.0023 0.0023 0.0021 0.0021 0.0020 0.0020 0.0020 0.0020 0.0019 0.0019 0.0019
A MMINISTRAZIONE locazione federale direttivo finanze versamento lire commi prescrizioni vietato contrattuale richiedente utilizzatore agevolazioni contabile appalto affidamento redditi sanzione somme indennit´a
0.1021 0.0021 0.0021 0.0021 0.0020 0.0020 0.0019 0.0019 0.0018 0.0018 0.0018 0.0018 0.0017 0.0017 0.0017 0.0017 0.0017 0.0017 0.0017 0.0016 0.0016
collected by using the same random querying procedure described in subsection 4.1. Its documents have been manually labeled, according to the 12 first level gDir topics, independently by three humans. The labeled document corpus consists of 1,012 documents. It is worthwhile to mention that each document can be associated with one or more labels, i.e. a document can mention one or more topics. In detail, 478 documents are singly labeled, 457 are doubly labeled, while only 77 are associated with three labels. The Multi-label Classifier, queried by using the binary document representation and by setting a posterior threshold equal to 0.5, achieves an accuracy equal to 73%, which can be considered satisfactory. The estimates of precision and recall for the four selected topics; namely C OMPUTER, S ALUTE, A FFARI and T EMPO L IBERO, are reported in Table 3. The best result is achieved for the topic A MMINIS TRAZIONE , where the precision equals 92%, i.e. if the Multi-label Classifier labels a document with the label A M MINISTRAZIONE , then the labeling is wrong with respect to the manual labeling with probability 0.08. Furthermore, the recall equals 79% which means that the documents manually labeled with A MMINISTRAZIONE are correctly labeled by the Multi-label Classifier with probability 0.79. The topic C OMPUTER achieves a precision value equal to 85% which is slightly lower than that achieved for A MMINISTRAZIONE. However, the achieved recall value drops from 79% of A MMINISTRAZIONE to 59%. The topic T EMPO L IBERO achieves a precision value equal to 78%, i.e. slightly lower than those achieved for C OMPUTER, while the achieved recall value is equal to 44%. Thus, the achieved recall value is significantly lower than that achieved for C OMPUTER. Finally, the topic S ALUTE achieves performances comparable to those of T EMPO L IBERO. Indeed, the precision equals 76%, while the recall equals 41%. Around 57% of documents manually labelled with T EMPO L IBERO and/or with S ALUTE, are not identified as such by the Multi-label Classifier. It is worthwhile to notice that for each label the achieved recall value is consistently lower than the precision value. A possible explanation for this behavior is as follows; the manual labeling procedure is both complex and ambiguous; it could label documents by using a broader meaning for each topic. Therefore, it is expected somewhat that automatic document classification could not achieve excellent performance with respect to both precision and recall. It is expected that the ‘purer’ a topic is, the better the performance achieved for the automatic documents classification task will be. However, it is important to keep in mind the difficulty of the considered labeling task, together with the fact that human labeling of documents can result in ambiguous and contradictory label assignment.
Table 3. Precision/Recall for S ALUTE (S AL .), C OMPUTER (C OM .), T EMPO L IBERO (T E L.) and A MMINISTRAZIONE (A MM .). Precision Recall
S AL . 76 41
C OM . 85 59
T E L. 78 44
A MM . 92 79
5. Conclusions and Future Work The overwhelming amount of textual data calls for efficient and effective methods to automatically summarize information and to extract valuable knowledge. Text Mining offers a rich set of computational models and algorithms to automatically extract valuable knowledge from huge amounts of semi and un-structured data. In this paper a software system for topic extraction and document classification has been described. The software system assists the user in correctly discovering which main topics are mentioned in a document collection. The discovered topic structure, after being user-validated, is used to implement an automatic document classifier. This model suggests labels to be used for each new document submitted by the user. It is important to mention that each document is not restricted to receive a single label but can be labeled with more topics. Furthermore, each topic, for a given document, is associated with a probability value that informs the user about the fitting of the topic to the considered document. This feature offers an important opportunity to the user who can sort his/her document collection in descending order of probability for each topic. However, it must be clearly stated that many improvements can be achieved by taking into account specific user requirements. This aspect is under investigation, and particular attention is being dedicated to non-parametric models for the discovery of hierarchical topic structures. Finally, the interplay between topics taxonomies and topic extraction algorithms offers an interesting research direction to explore.
References [1] R. Feldman and J. Sanger, The Text Mining Handbook. New York: Cambridge University Press, 2007. [2] E. Wigner, “The unreasonable effectiveness of mathematics in the natural sciences,” Communications in Pure and Applied Mathematics, vol. 13, no. 1, pp. 1–14, February 1960. [3] A. Halevy, P. Norvig, and F. Pereira, “The unreasonable effectiveness of data,” IEEE Intelligent Systems, vol. 24, no. 2, pp. 8–12, 2009. [4] M. W. Berry and M. Castellanos, Survey of Text Mining II: clustering, classification and retrieval. London: Springer, 2008.
[5] T. L. Griffiths and M. Steyvers, “A probabilistic approach to semantic representation,” in Proceedings of the Twenty-Fourth Annual Conference of Cognitive Science Society, G. W. and C. Schunn, Eds., 2002, pp. 381–386.
[20] A. Mccallum and K. Nigam, “A comparison of event models for naive bayes text classification,” in In AAAI-98 Workshop on Learning for Text Categorization. AAAI Press, 1998, pp. 41–48.
[6] T. Hofmann, “Probabilistic latent semantic analysis,” in In Proc. of Uncertainty in Artificial Intelligence, UAI99, 1999, pp. 289–296.
[21] D. D. Lewis, “Naive (bayes) at forty: The independence assumption in information retrieval,” in ECML ’98: Proceedings of the 10th European Conference on Machine Learning. London, UK: Springer-Verlag, 1998, pp. 4–15.
[7] D. M. Blei, N. Andrew, and M. I. Jordan, “Latent dirichlet allocation,” Journal of Machine Learning Research, vol. 3, pp. 993–1022, January 2003. [8] T. L. Griffiths and M. Steyvers, Probabilistic Topic Models, S. D. . W. K. T. Landauer, D. McNamara, Ed. Erlbaum, 2007.
[22] S. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive learning algorithms and representations for text categorization,” in CIKM ’98: Proceedings of the seventh international conference on Information and knowledge management. New York, NY, USA: ACM Press, 1998, pp. 148–155.
[9] T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine Learning, vol. 42, no. 1-2, pp. 177–196, 2001. [Online]. Available: http://portal.acm.org/citation.cfm?id=599631
[23] Y. Yang and X. Liu, “A re-examination of text categorization methods,” in SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: ACM Press, 1999, pp. 42–49.
[10] T. L. Griffiths and M. Steyvers, “Finding scientific topics.” Proc Natl Acad Sci U S A, vol. 101 Suppl 1, pp. 5228–5235, April 2004.
[24] A. K. McCallum and T. Mitchell, “Text classification from labeled and unlabeled documents using em,” in Machine Learning, 2000, pp. 103–134.
[11] T. Joachims, Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. New York: Springer, 2002.
[25] Y. Yang and C. G. Chute, “An example-based mapping method for text categorization and retrieval,” ACM Trans. Inf. Syst., vol. 12, no. 3, pp. 252–277, 1994.
[12] R. Cooley, “Classification of news stories using support vector machines,” in Proc. 16th International Joint Conference on Artificial Intelligence Text Mining Workshop, 1999.
[26] R. E. Schapire and Y. Singer, “Boostexter: a boosting-based system for text categorization,” in Machine Learning, 2000, pp. 135–168.
[13] H. Lodhi, C. Saunders, N. Cristianini, C. Watkins, and B. Scholkopf, “Text classification using string kernels,” Journal of Machine Learning Research, vol. 2, pp. 563–569, 2002.
[27] H.-C. R. H. M. Sang-Bum Kim, Kyoung-Soo Han, “Some effective techniques for naive bayes text classification,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 11, pp. 1457–1466, January 2006.
[14] L. Chen, J. Huang, and Z.-H. Gong, “An anti-noise text categorization method based on support vector machines,” in AWIC, 2005, pp. 272–278. [15] A. Lehmann and J. Shawe-Taylor, “A probabilistic model for text kernels,” in ICML ’06: Proceedings of the 23rd international conference on Machine learning. New York, NY, USA: ACM, 2006, pp. 537–544. [16] V. N. Vapnik, The Nature of Statistical Learning Theory (Information Science and Statistics). Springer, 1999. [17] T. Joachims, “Transductive inference for text classification using support vector machines,” in Proceedings of ICML99, 16th International Conference on Machine Learning, I. Bratko and S. Dzeroski, Eds. Morgan Kaufmann Publishers, San Francisco, US, 1999, pp. 200–209. [18] S. Tong and D. Koller, “Support vector machine active learning with applications to text classification,” Journal of Machine Learning Research, vol. 2, pp. 45–66, November 2001. [19] C. Xu and Y. Zhou, “Transductive support vector machine for personal inboxes spam categorization,” Computational Intelligence and Security Workshops, International Conference on, vol. 0, pp. 459–463, 2007.
[28] G. Salton and M. J. Mcgill, Introduction to Modern Information Retrieval. New York, NY, USA: McGraw-Hill, Inc., 1986.