The Data

Data Source: http://www.csmining.org/index.php/spam-email-datasets-.html

The dataset contains two parts:
- TRAINING: 4327 messages out of which there are 2949 non-spam messages (HAM) and 1378 spam messagees (SPAM), all received from non-spam-trap sources.
SPAMTrain.label contains the labels of the emails, with 1 stands for a HAM and 0 stands for a SPAM.
- TESTING: 4292 messages without known class labels.
The format of the .eml file is definde in RFC822, and information on recent standard of email, i.e., MIME (Multipurpose Internet Mail Extensions) can be find in RFC2045-2049.
Since some data mining techniques only make use of the subject and body of the email to identify spam. In this package, we have included a simple python script (ExtractContent.py) which can help to extract the subject and body of the email.

In a python compatible environment, ( the code is test on python 2.5.1 and should work on python 2.x)
1, invoke the script by command ./ExtractContent.py
2, input source directory -- where you store the source files For exmaple C:\EMAILPro\CSDMC2010_SPAM\TEST
3, input destination directory -- where you want the extracted body to be For example C:\EMAILPro\CSDMC2010_SPAM\TEST_NEW
4, we are done

Note that, the script only extract limited information from the email (no information of fields like to, from, attachment are extract but only the subject and the first part of the body.) By oferring such a script we just want to show a simple preprocessing mehtod where the participants can start from. More advanced method which makes use of email header information or even attachment information are encouraged.

The Script

First, import the packages we will need. Add packages into these lines as you go and realize that you need more packages. Some of the ones we use in this model are:
- sklearn for building machine learning models
- re for parsing strings using regular expressions
- os for manipulating filepaths in our computer
- BeautifulSoup for manipulating HTML code
- pandas and numpy for data analysis
In [1]:
import os, sys, sklearn, re
import BeautifulSoup as bs
import pandas as pd
import numpy as np
In [2]:
# load the file into memory
filepath_train = "/Users/apple/Desktop/Kaggle/SpamEmail/CSDMC2010_SPAM/CSDMC2010_SPAM/TRAINING_extract"
In [4]:
# parse the file into a data frame
files = [{'file':f, 'path':os.path.join(filepath_train, f), 'content':'\n'.join(open(os.path.join(filepath_train, f)).readlines())} for f in os.listdir(filepath_train) if os.path.isfile(os.path.join(filepath_train, f))]
df_files = pd.DataFrame(files)
# take a peak at the top 5 rows to make sure we're doing it correctly
df_files.head()
Out[4]:
content file path
0 One of a kind Money maker! Try it for free!Fro... TRAIN_00000.eml /Users/apple/Desktop/Kaggle/SpamEmail/CSDMC201...
1 link to my webcam you wanted Wanna see sexuall... TRAIN_00001.eml /Users/apple/Desktop/Kaggle/SpamEmail/CSDMC201...
2 Re: How to manage multiple Internet connection... TRAIN_00002.eml /Users/apple/Desktop/Kaggle/SpamEmail/CSDMC201...
3 [SPAM] Give her 3 hour rodeoEnhance your desi... TRAIN_00003.eml /Users/apple/Desktop/Kaggle/SpamEmail/CSDMC201...
4 Best Price on the netf5f8m1 (suddenlysusan@Sto... TRAIN_00004.eml /Users/apple/Desktop/Kaggle/SpamEmail/CSDMC201...
In [6]:
# load the labels into memory
filepath_label = "/Users/apple/Desktop/Kaggle/SpamEmail/CSDMC2010_SPAM/CSDMC2010_SPAM/SPAMTrain.label"
df_labels = pd.DataFrame.from_csv(filepath_label, sep=" ", header = None, index_col = None)
df_labels.columns = ['class','file']
# take a peak at the top 5 rows
df_labels.head()
Out[6]:
class file
0 0 TRAIN_00000.eml
1 0 TRAIN_00001.eml
2 1 TRAIN_00002.eml
3 0 TRAIN_00003.eml
4 0 TRAIN_00004.eml
In [7]:
# combine the two tables by merging on the common column, which is the "file" column
df_files_labeled = df_files.merge(df_labels)
df_files_labeled.head()
Out[7]:
content file path class
0 One of a kind Money maker! Try it for free!Fro... TRAIN_00000.eml /Users/apple/Desktop/Kaggle/SpamEmail/CSDMC201... 0
1 link to my webcam you wanted Wanna see sexuall... TRAIN_00001.eml /Users/apple/Desktop/Kaggle/SpamEmail/CSDMC201... 0
2 Re: How to manage multiple Internet connection... TRAIN_00002.eml /Users/apple/Desktop/Kaggle/SpamEmail/CSDMC201... 1
3 [SPAM] Give her 3 hour rodeoEnhance your desi... TRAIN_00003.eml /Users/apple/Desktop/Kaggle/SpamEmail/CSDMC201... 0
4 Best Price on the netf5f8m1 (suddenlysusan@Sto... TRAIN_00004.eml /Users/apple/Desktop/Kaggle/SpamEmail/CSDMC201... 0
In [19]:
#create a hash of the words in each file
file_words = []

for idx, row in df_files_labeled.iterrows():
    word_counter = {}
    content = row['content']
    parsed = re.sub(r'<[^<>]*>', '', content)
    words = [w.strip('(').strip(')').strip('"') for w in re.split(r'[!,\.\s\n]',parsed) if len(w)>0]
    for w in words:
        if w.lower() in word_counter:
            word_counter[w.lower()]+=1
        else:
            word_counter[w.lower()]=1
    file_words.append(word_counter)
In [20]:
# if a word isn't represented, make it 0 rather than NA
df_words = pd.DataFrame(file_words).fillna(0)
In [21]:
# Take only words that are represented more than 50 times throughout all documents
df_filtered = df_words[df_words.columns[df_words.sum()>50]]
In [22]:
# check at the matrix sizes look right
print df_words.shape
print df_filtered.shape
(4327, 100809)
(4327, 2724)

In [44]:
#take df_filtered.head()
Out[44]:
# #1 #cc3366 $ $1 $100 $2 $25 $250 ... z zdnet zero { | } ~/ »
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 2 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 2724 columns

Let's split the data into training and validation set
In [25]:
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(df_filtered.as_matrix(), df_files_labeled['class'], test_size=0.3, random_state=0)
Gaussian Naive Bayes model
In [30]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb = gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
Calculate the accuracy score
In [31]:
sklearn.metrics.accuracy_score(y_test, y_pred)
Out[31]:
0.94457274826789839
In [32]:
 
Random forest model
In [34]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
In [35]:
sklearn.metrics.accuracy_score(y_test, y_pred)
Out[35]:
0.9599692070823711
But why use one train/validation split? Let' do cross-validation
In [38]:
scores = sklearn.cross_validation.cross_val_score(clf, df_filtered.as_matrix(), df_files_labeled['class'])
scores.mean()
Out[38]:
0.96856939089948801
In []:
 
Extra Trees Classifier
In [40]:
clf = sklearn.ensemble.ExtraTreesClassifier(n_estimators=10, max_depth=None,min_samples_split=1, random_state=0)
scores = sklearn.cross_validation.cross_val_score(clf, df_filtered.as_matrix(), df_files_labeled['class'])
scores.mean()
Out[40]:
0.97180403491083112
AdaBoost Classifier
In [42]:
clf = sklearn.ensemble.AdaBoostClassifier(n_estimators=100)
scores = sklearn.cross_validation.cross_val_score(clf, df_filtered.as_matrix(), df_files_labeled['class'])
scores.mean()
Out[42]:
0.97804408484020133
So 2.2% of your spam emails will come to your inbox (a bit annoying) and 2.2% of your real emails will go into the spam folder (Not acceptable!) How do we make sure that none of your important emails go into the spam folder?
Draw an ROC curve
In [62]:
clf = sklearn.ensemble.AdaBoostClassifier(n_estimators=100)
clf = clf.fit(X_train, y_train)
In [63]:
y_score = [score[0] for score in clf.predict_proba(X_test)]
In [64]:
fpr, tpr, thresholds = sklearn.metrics.roc_curve(y_test, y_score, pos_label = 0)
In [78]:
import matplotlib.pyplot as plt
%pylab inline
plt.plot(fpr, tpr)
plt.xlim(0,0.3)
plt.ylim(0.5,1)
Populating the interactive namespace from numpy and matplotlib

Out[78]:
(0.5, 1)
In [70]:
pd.DataFrame({'fpr':fpr, 'tpr':tpr, '1-tpr':1-tpr, 'thresholds':thresholds}).sort(columns=['fpr', '1-tpr']).head()
Out[70]:
1-tpr fpr thresholds tpr
23 0.518957 0 0.533494 0.481043
22 0.526066 0 0.535016 0.473934
21 0.533175 0 0.535154 0.466825
20 0.554502 0 0.538358 0.445498
19 0.559242 0 0.538463 0.440758
In [80]:
pd.DataFrame({'fpr':fpr, 'tpr':tpr, '1-tpr':1-tpr, 'thresholds':thresholds}).sort(columns=['1-tpr', 'fpr']).head()
Out[80]:
1-tpr fpr thresholds tpr
109 0 0.140251 0.474931 1
110 0 0.168757 0.470364 1
111 0 0.171038 0.469907 1
112 0 0.197263 0.467265 1
113 0 0.199544 0.467204 1
So based on this, we can build a classifier with 0% false positive rate (0% of non-spam emails will be classified as spam) and 48% true positive rate (48% of spam emails will be filtered into the spam folder) Alternatively, we can build a classifier with 100% true positive rate (all spam emails are caught) and 14% false positive rate (14% of real emails go into the spam folder)
How can we do better from here?
- try additional models - try other parameters - make better features - balance the classes
Now report the test error
In []:
filepath_train = "/Users/apple/Desktop/Kaggle/SpamEmail/CSDMC2010_SPAM/CSDMC2010_SPAM/TEST_extract"