Data Source: http://www.csmining.org/index.php/spam-email-datasets-.html
The dataset contains two parts:
- TRAINING: 4327 messages out of which there are 2949 non-spam messages (HAM) and 1378 spam messagees (SPAM), all received from non-spam-trap sources.
SPAMTrain.label contains the labels of the emails, with 1 stands for a HAM and 0 stands for a SPAM.
- TESTING: 4292 messages without known class labels.
The format of the .eml file is definde in RFC822, and information on recent standard of email, i.e., MIME (Multipurpose Internet Mail Extensions) can be find in RFC2045-2049.
Since some data mining techniques only make use of the subject and body of the email to identify spam. In this package, we have included a simple python script (ExtractContent.py) which can help to extract the subject and body of the email.
In a python compatible environment, ( the code is test on python 2.5.1 and should work on python 2.x)
1, invoke the script by command ./ExtractContent.py
2, input source directory -- where you store the source files For exmaple C:\EMAILPro\CSDMC2010_SPAM\TEST
3, input destination directory -- where you want the extracted body to be For example C:\EMAILPro\CSDMC2010_SPAM\TEST_NEW
4, we are done
Note that, the script only extract limited information from the email (no information of fields like to, from, attachment are extract but only the subject and the first part of the body.) By oferring such a script we just want to show a simple preprocessing mehtod where the participants can start from. More advanced method which makes use of email header information or even attachment information are encouraged.
First, import the packages we will need. Add packages into these lines as you go and realize that you need more packages. Some of the ones we use in this model are:
- sklearn for building machine learning models
- re for parsing strings using regular expressions
- os for manipulating filepaths in our computer
- BeautifulSoup for manipulating HTML code
- pandas and numpy for data analysis
Out[4]:
|
content |
file |
path |
0 |
One of a kind Money maker! Try it for free!Fro... |
TRAIN_00000.eml |
/Users/apple/Desktop/Kaggle/SpamEmail/CSDMC201... |
1 |
link to my webcam you wanted Wanna see sexuall... |
TRAIN_00001.eml |
/Users/apple/Desktop/Kaggle/SpamEmail/CSDMC201... |
2 |
Re: How to manage multiple Internet connection... |
TRAIN_00002.eml |
/Users/apple/Desktop/Kaggle/SpamEmail/CSDMC201... |
3 |
[SPAM] Give her 3 hour rodeoEnhance your desi... |
TRAIN_00003.eml |
/Users/apple/Desktop/Kaggle/SpamEmail/CSDMC201... |
4 |
Best Price on the netf5f8m1 (suddenlysusan@Sto... |
TRAIN_00004.eml |
/Users/apple/Desktop/Kaggle/SpamEmail/CSDMC201... |
Out[6]:
|
class |
file |
0 |
0 |
TRAIN_00000.eml |
1 |
0 |
TRAIN_00001.eml |
2 |
1 |
TRAIN_00002.eml |
3 |
0 |
TRAIN_00003.eml |
4 |
0 |
TRAIN_00004.eml |
Out[7]:
|
content |
file |
path |
class |
0 |
One of a kind Money maker! Try it for free!Fro... |
TRAIN_00000.eml |
/Users/apple/Desktop/Kaggle/SpamEmail/CSDMC201... |
0 |
1 |
link to my webcam you wanted Wanna see sexuall... |
TRAIN_00001.eml |
/Users/apple/Desktop/Kaggle/SpamEmail/CSDMC201... |
0 |
2 |
Re: How to manage multiple Internet connection... |
TRAIN_00002.eml |
/Users/apple/Desktop/Kaggle/SpamEmail/CSDMC201... |
1 |
3 |
[SPAM] Give her 3 hour rodeoEnhance your desi... |
TRAIN_00003.eml |
/Users/apple/Desktop/Kaggle/SpamEmail/CSDMC201... |
0 |
4 |
Best Price on the netf5f8m1 (suddenlysusan@Sto... |
TRAIN_00004.eml |
/Users/apple/Desktop/Kaggle/SpamEmail/CSDMC201... |
0 |
(4327, 100809)
(4327, 2724)
Out[44]:
|
|
# |
#1 |
#cc3366 |
$ |
$1 |
$100 |
$2 |
$25 |
$250 |
... |
z |
zdnet |
zero |
{ |
| |
} |
~/ |
� |
� |
» |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
3 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
5 rows × 2724 columns
Let's split the data into training and validation set¶
Gaussian Naive Bayes model¶
Calculate the accuracy score¶
But why use one train/validation split? Let' do cross-validation¶
So 2.2% of your spam emails will come to your inbox (a bit annoying) and 2.2% of your real emails will go into the spam folder (Not acceptable!) How do we make sure that none of your important emails go into the spam folder?
Populating the interactive namespace from numpy and matplotlib
Out[70]:
|
1-tpr |
fpr |
thresholds |
tpr |
23 |
0.518957 |
0 |
0.533494 |
0.481043 |
22 |
0.526066 |
0 |
0.535016 |
0.473934 |
21 |
0.533175 |
0 |
0.535154 |
0.466825 |
20 |
0.554502 |
0 |
0.538358 |
0.445498 |
19 |
0.559242 |
0 |
0.538463 |
0.440758 |
Out[80]:
|
1-tpr |
fpr |
thresholds |
tpr |
109 |
0 |
0.140251 |
0.474931 |
1 |
110 |
0 |
0.168757 |
0.470364 |
1 |
111 |
0 |
0.171038 |
0.469907 |
1 |
112 |
0 |
0.197263 |
0.467265 |
1 |
113 |
0 |
0.199544 |
0.467204 |
1 |
So based on this, we can build a classifier with 0% false positive rate (0% of non-spam emails will be classified as spam) and 48% true positive rate (48% of spam emails will be filtered into the spam folder)
Alternatively, we can build a classifier with 100% true positive rate (all spam emails are caught) and 14% false positive rate (14% of real emails go into the spam folder)
How can we do better from here?¶
- try additional models
- try other parameters
- make better features
- balance the classes
Now report the test error¶