First Conference on Email and Anti-Spam (CEAS)
Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis (
[email protected]), Ion Androutsopoulos (
[email protected]), George Paliouras (
[email protected]), George Sakkis (
[email protected]), Panagiotis Stamatopoulos (
[email protected]) Mountain View, CA, July 30th and 31st 2004
Outline
Spam Filtering: past, present and future Anti-spam filtering with Filtron In Vitro Evaluation In Vivo Evaluation Conclusions
Spam Filtering: past, present and future
Past: Black-lists and white-lists of e-mail addresses Handcrafted rules looking for suspicious keywords and patterns in headers
Present: Machine learning-based filters – –
Mostly using Naïve Bayes classifier Examples: Mozilla’s spam filter, POPFILE, K9
Signature based filtering (Vipul’s Razor)
Future: Combination of several techniques (SpamAssassin)
Filtron: An overview
A multi-platform learning-based anti-spam filter. Features for simple the user: Personalized: based on her legitimate messages Automatically updating black/white lists Efficient: server-side filtering and interception rules
Features for the advanced user and the researcher: Customizable learning component –
Through Weka open source machine learning platform
Support for creating publicly available message collections –
Privacy-preserving encoding of messages and user profiles
Portable: Implemented in Java and Tcl/Tk Currently supported under POSIX-compatible mail servers (MS Exchange Server port efforts under way)
Filtron’s Architecture Spam folders
Legitimate folders
User model
Filtron Preprocessor
black list, white list
induced classifier
Attribute Selector attribute set Vectorizer trainin g vectors Learner
Preprocessing 1. 2.
Break down mailbox(es) into distinct messages Remove from every message: mail headers html tags attached files
3. 4.
Remove messages with no textual content Store 5 messages per sender Avoids bias towards regular correspondents.
Remove duplicates Encode messages (optional)
Message Classification
From: sender@provider
Incoming e-mail
Unix Mail Server
Procm ail
Black List
Dear Fred , Thanks for the immediate reply. I am glad to hear...
Attachments: 1. File.zip
Address Book
Classifier
Classified e-mail Classification User’s Mailbox
Filtron
User’s Profile
In Vitro Evaluation
We investigated the effect of: Single-token versus multi-token attributes (n-grams for n=1,2,3) Number of attributes (40-3000) Learning algorithm (Naïve Bayes, Flexible Bayes, SVMs, LogitBoost) Training corpus size (~ 10%-100% of full training corpus)
Cost-Sensitive Learning Formulation Misclassifying a legitimate message as spam (LS) is λ times more serious an error than misclassifying a spam to legitimate (SL) Two usage scenarios (λ = 1, 9)
In Vitro Evaluation (cont.)
Evaluation: Four message collections (PU1, PU2, PU3, PUA) Stratified 10-fold cross validation
Results: No clear winner among learning algorithms wrt accuracy ⇒ Efficiency (or other criteria) more important for real usage.
Nevertheless, SVMs consistently among two best No substantial improvement with n-grams (for n>1)
Refer to the TR for more details: Learning to filter unsolicited commercial e-mail, TRN 2004/2, NCSR “Demokritos” (http://www.iit.demokritos.gr/skel/i-config/)
Summary of in Vitro Evaluation
λ=1 Pr
Re
λ=9 WAcc
Pr
Re
WAcc
91.57 98.88 97.71 98.12
92.17 74.63 74.89 78.33
94.87 97.76 97.24 97.60
97.43 98.70
81.36 76.40
96.91 97.67
1-grams
Naive Bayes Flexible Bayes LogitBoost SVM
90.56 95.55 92.43 94.95
Flexible Bayes SVM
92.98 94.73
94.73 89.89 90.08 91.43
94.65 95.15 93.64 95.42 1/2/3-grams
91.89 91.70
93.89 95.05
In Vivo Evaluation
Seven month live-evaluation by the third author Training collection: PU3 2313 legitimate / 1826 spam
Learning algorithm: SVM Cost scenario: λ = 1 Retained attributes: 520 1-grams Numeric values (term frequency)
No black-list was used
Summary of in Vivo Evaluation
Days used Messages received Spam messages received Legitimate messages received Legitimate-to-Spam Ratio Correctly classified legitimate messages (LL) Incorrectly classified legitimate messages (LS) Correctly classified spam messages (SS) Incorrectly classified spam messages (SL)
212 6732 (avg. 31.75 per day) 1623 (avg. 7.66 per day) 5109 (avg. 24.10 per day) 3.15 5057 52 (avg. 1.72 per week) 1450 173 (avg. 5.71 per week)
Precision 96.54% (PU3: 96.43%) Recall 89.34% (PU3: 95.05%) WAcc 96.66% (PU3: 96.22%)
Post-Mortem Analysis False Positives
52 false positives (out of 6732) 52%: Automatically generated messages subscription verifications, virus warnings, etc.
22%: Very short messages 3-5 words in message body Along with attachments and hyperlinks
26%: Short messages 1-2 lines Written in casual style, often exploited by spammers With no attachments or hyperlinks
Post-Mortem Analysis False Negatives
173 false negatives (out of 6732) 30%: “Hard Spam”
Little textual information, avoiding common suspicious word patterns Many images and hyperlinks Tricks to confuse tokenizers
8%: Advertisements of pornographic sites with very casual and well chosen vocabulary 23%: Non-English messages Under-represented in the training corpus
30%: Encoded messages BASE64 format; Filtron could not process it at that time
6%: Hoax letters
Long formal letters (“tremendous business opportunity !”) Many occurrences of the receiver’s full name
3%: Short messages with unusual content
Conclusions
Signs of arms race between spammers and content-based filters Filtron’s performance deemed satisfactory, though it can be improved with: More elaborate preprocessing to tackle usual countermeasures of spammers (misspellings, uncommon words, text on images) Regular retraining
Currently most promising approach: combination of different filtering approaches along with Machine Learning
Collaborative filtering Filtering in the transport layer level …