142

  • Uploaded by: Nguyễn Anh Dũng
  • 0
  • 0
  • April 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View 142 as PDF for free.

More details

  • Words: 883
  • Pages: 15
First Conference on Email and Anti-Spam (CEAS)

Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis ([email protected]), Ion Androutsopoulos ([email protected]), George Paliouras ([email protected]), George Sakkis ([email protected]), Panagiotis Stamatopoulos ([email protected]) Mountain View, CA, July 30th and 31st 2004

Outline     

Spam Filtering: past, present and future Anti-spam filtering with Filtron In Vitro Evaluation In Vivo Evaluation Conclusions

Spam Filtering: past, present and future 

Past:  Black-lists and white-lists of e-mail addresses  Handcrafted rules looking for suspicious keywords and patterns in headers



Present:  Machine learning-based filters – –

Mostly using Naïve Bayes classifier Examples: Mozilla’s spam filter, POPFILE, K9

 Signature based filtering (Vipul’s Razor) 

Future:  Combination of several techniques (SpamAssassin)

Filtron: An overview  

A multi-platform learning-based anti-spam filter. Features for simple the user:  Personalized: based on her legitimate messages  Automatically updating black/white lists  Efficient: server-side filtering and interception rules



Features for the advanced user and the researcher:  Customizable learning component –

Through Weka open source machine learning platform

 Support for creating publicly available message collections –

 

Privacy-preserving encoding of messages and user profiles

Portable: Implemented in Java and Tcl/Tk Currently supported under POSIX-compatible mail servers (MS Exchange Server port efforts under way)

Filtron’s Architecture Spam folders

Legitimate folders

User model

Filtron Preprocessor

black list, white list

induced classifier

Attribute Selector attribute set Vectorizer trainin g vectors Learner

Preprocessing 1. 2.

Break down mailbox(es) into distinct messages Remove from every message:  mail headers  html tags  attached files

3. 4.

Remove messages with no textual content Store 5 messages per sender  Avoids bias towards regular correspondents.

 

Remove duplicates Encode messages (optional)

Message Classification

From: sender@provider

Incoming e-mail

Unix Mail Server

Procm ail

Black List

Dear Fred , Thanks for the immediate reply. I am glad to hear...

Attachments: 1. File.zip

Address Book

Classifier

Classified e-mail Classification User’s Mailbox

Filtron

User’s Profile

In Vitro Evaluation 

We investigated the effect of:  Single-token versus multi-token attributes (n-grams for n=1,2,3)  Number of attributes (40-3000)  Learning algorithm (Naïve Bayes, Flexible Bayes, SVMs, LogitBoost)  Training corpus size (~ 10%-100% of full training corpus)



Cost-Sensitive Learning Formulation  Misclassifying a legitimate message as spam (LS) is λ times more serious an error than misclassifying a spam to legitimate (SL)  Two usage scenarios (λ = 1, 9)

In Vitro Evaluation (cont.) 

Evaluation:  Four message collections (PU1, PU2, PU3, PUA)  Stratified 10-fold cross validation



Results:  No clear winner among learning algorithms wrt accuracy ⇒ Efficiency (or other criteria) more important for real usage.

 Nevertheless, SVMs consistently among two best  No substantial improvement with n-grams (for n>1) 

Refer to the TR for more details:  Learning to filter unsolicited commercial e-mail, TRN 2004/2, NCSR “Demokritos” (http://www.iit.demokritos.gr/skel/i-config/)

Summary of in Vitro Evaluation

λ=1 Pr

Re

λ=9 WAcc

Pr

Re

WAcc

91.57 98.88 97.71 98.12

92.17 74.63 74.89 78.33

94.87 97.76 97.24 97.60

97.43 98.70

81.36 76.40

96.91 97.67

1-grams

Naive Bayes Flexible Bayes LogitBoost SVM

90.56 95.55 92.43 94.95

Flexible Bayes SVM

92.98 94.73

94.73 89.89 90.08 91.43

94.65 95.15 93.64 95.42 1/2/3-grams

91.89 91.70

93.89 95.05

In Vivo Evaluation  

Seven month live-evaluation by the third author Training collection: PU3  2313 legitimate / 1826 spam

  

Learning algorithm: SVM Cost scenario: λ = 1 Retained attributes: 520 1-grams  Numeric values (term frequency)



No black-list was used

Summary of in Vivo Evaluation

Days used Messages received Spam messages received Legitimate messages received Legitimate-to-Spam Ratio Correctly classified legitimate messages (LL) Incorrectly classified legitimate messages (LS) Correctly classified spam messages (SS) Incorrectly classified spam messages (SL)

212 6732 (avg. 31.75 per day) 1623 (avg. 7.66 per day) 5109 (avg. 24.10 per day) 3.15 5057 52 (avg. 1.72 per week) 1450 173 (avg. 5.71 per week)

Precision 96.54% (PU3: 96.43%) Recall 89.34% (PU3: 95.05%) WAcc 96.66% (PU3: 96.22%)

Post-Mortem Analysis False Positives  

52 false positives (out of 6732) 52%: Automatically generated messages  subscription verifications, virus warnings, etc.



22%: Very short messages  3-5 words in message body  Along with attachments and hyperlinks



26%: Short messages  1-2 lines  Written in casual style, often exploited by spammers  With no attachments or hyperlinks

Post-Mortem Analysis False Negatives  

173 false negatives (out of 6732) 30%: “Hard Spam”   

 

Little textual information, avoiding common suspicious word patterns Many images and hyperlinks Tricks to confuse tokenizers

8%: Advertisements of pornographic sites with very casual and well chosen vocabulary 23%: Non-English messages  Under-represented in the training corpus



30%: Encoded messages  BASE64 format; Filtron could not process it at that time



6%: Hoax letters  



Long formal letters (“tremendous business opportunity !”) Many occurrences of the receiver’s full name

3%: Short messages with unusual content

Conclusions 



Signs of arms race between spammers and content-based filters Filtron’s performance deemed satisfactory, though it can be improved with:  More elaborate preprocessing to tackle usual countermeasures of spammers (misspellings, uncommon words, text on images)  Regular retraining



Currently most promising approach: combination of different filtering approaches along with Machine Learning   

Collaborative filtering Filtering in the transport layer level …

Related Documents

142
November 2019 35
142
November 2019 37
142
December 2019 37
142
November 2019 34
142
April 2020 20
P-142
November 2019 18

More Documents from ""

May 2020 47
Aya Kito/ 1 Litre Of Tears
December 2019 53
Cv09sav
May 2020 37