Data Source: http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

Number of Instances: 5620
Number of Attributes: 64

Relevant Information (from website):
We used preprocessing programs made available by NIST to extract normalized bitmaps of handwritten digits from a preprinted form. From a total of 43 people, 30 contributed to the training set and different 13 to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of 4x4 and the number of on pixels are counted in each block. This generates an input matrix of 8x8 where each element is an integer in the range 0..16. This reduces dimensionality and gives invariance to small distortions.
In [61]:
import os, sys, sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%pylab inline
Populating the interactive namespace from numpy and matplotlib

WARNING: pylab import has clobbered these variables: ['f']
`%matplotlib` prevents importing * from pylab and numpy

1. Load and look at the data

In [9]:
!ls OpticalDigits
optdigits-orig.cv.Z     optdigits-orig.tra.Z    optdigits-orig.windep.Z optdigits.tes           readme.txt
optdigits-orig.names    optdigits-orig.wdep.Z   optdigits.names         optdigits.tra

In [104]:
filepath_train = 'OpticalDigits/optdigits.tra'
filepath_test = 'OpticalDigits/optdigits.tes'
In [106]:
data_train = pd.DataFrame.from_csv(filepath_train, header = None, index_col = None)
data_test = pd.DataFrame.from_csv(filepath_test, header = None, index_col = None)
In [107]:
print "Size of training set: %d"%len(data_train)
print "Size of test set: %d"%len(data_test)
Size of training set: 3823
Size of test set: 1797

In [108]:
#split the data into features and labels
train_x = data_train[data_train.columns[:64]]
train_y = data_train[data_train.columns[64]]
test_x = data_test[data_test.columns[:64]]
test_y = data_test[data_test.columns[64]]
In [109]:
#verify that the data is correct
for i in range(5):
    print "Label: %d"%train_y[i]
    img = plt.figure()
    plt.imshow(train_x.loc[i].reshape((8,8)), cmap = cm.Greys_r)
    img.set_size_inches ((0.8,0.8))
    plt.show()
Label: 0

Label: 0

Label: 7

Label: 4

Label: 6

In [110]:
#look at the label distribution of the data
train_y.hist()
Out[110]:
<matplotlib.axes.AxesSubplot at 0x1081d4e50>

2. Train a simple model on the data

Nearest Neighbor (kNN, k=1)

In [111]:
# Reference: http://cs231n.github.io/classification/
import numpy as np

class NearestNeighbor:
  def __init__(self):
    pass

  def train(self, X, y):
    """ X is N x D where each row is an example. Y is 1-dimension of size N """
    # the nearest neighbor classifier simply remembers all the training data
    self.Xtr = X
    self.ytr = y

  def predict(self, X):
    """ X is N x D where each row is an example we wish to predict label for """
    num_test = X.shape[0]
    # lets make sure that the output type matches the input type
    Ypred = np.zeros(num_test, dtype = self.ytr.dtype)

    # loop over all test rows
    for i in xrange(num_test):
      # find the nearest training image to the i'th test image
      # using the L1 distance (sum of absolute value differences)
      distances = np.sum(np.abs(self.Xtr - X[i,:]), axis = 1)
      min_index = np.argmin(distances) # get the index with smallest distance
      Ypred[i] = self.ytr[min_index] # predict the label of the nearest example

    return Ypred
In [119]:
knn = NearestNeighbor()
knn.train(train_x.as_matrix(), train_y.as_matrix())
In [140]:
test_x.head(5)
Out[140]:
0 1 2 3 4 5 6 7 8 9 ... 54 55 56 57 58 59 60 61 62 63
0 0 0 5 13 9 1 0 0 0 0 ... 0 0 0 0 6 13 10 0 0 0
1 0 0 0 12 13 5 0 0 0 0 ... 0 0 0 0 0 11 16 10 0 0
2 0 0 0 4 15 12 0 0 0 0 ... 5 0 0 0 0 3 11 16 9 0
3 0 0 7 15 13 1 0 0 0 8 ... 9 0 0 0 7 13 13 9 0 0
4 0 0 0 1 11 0 0 0 0 0 ... 0 0 0 0 0 2 16 4 0 0

5 rows × 64 columns

In [131]:
print "Predicting %d examples..."%len(test_x)
%time pred = knn.predict(test_x.as_matrix())
Predicting 1797 examples...
CPU times: user 1.54 s, sys: 11.1 ms, total: 1.55 s
Wall time: 1.55 s

In [132]:
import sklearn.metrics as metrics
print "Accuracy: %.2f"%metrics.accuracy_score(test_y, pred)
Accuracy: 0.97

In [133]:
print metrics.classification_report(test_y, pred)
             precision    recall  f1-score   support

          0       0.99      1.00      1.00       178
          1       0.93      1.00      0.96       182
          2       0.99      0.99      0.99       177
          3       0.97      0.98      0.98       183
          4       0.98      0.98      0.98       181
          5       0.97      0.98      0.98       182
          6       0.99      0.99      0.99       181
          7       0.99      0.97      0.98       179
          8       0.98      0.91      0.95       174
          9       0.94      0.93      0.93       180

avg / total       0.97      0.97      0.97      1797


In [137]:
print metrics.confusion_matrix(test_y, pred)
print "X-axis: predicted, Y-axis: true"
[[178   0   0   0   0   0   0   0   0   0]
 [  0 182   0   0   0   0   0   0   0   0]
 [  0   2 175   0   0   0   0   0   0   0]
 [  0   0   0 180   0   0   0   2   0   1]
 [  0   2   0   0 178   0   0   0   1   0]
 [  0   0   0   0   1 179   0   0   0   2]
 [  1   0   0   0   0   1 179   0   0   0]
 [  0   0   0   0   0   0   0 174   0   5]
 [  0  10   1   0   0   0   1   0 159   3]
 [  0   0   0   5   2   4   0   0   2 167]]
X-axis: predicted, Y-axis: true

In [157]:
from sklearn import svm
clf = svm.SVC(decision_function_shape='ovo')
clf.fit(train_x.as_matrix(), train_y) 
Out[157]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovo', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
In [158]:
dec = clf.predict(test_x)
In [160]:
print metrics.confusion_matrix(test_y, dec)
print "Accuracy: %.2f"%metrics.accuracy_score(test_y, dec)
[[ 99   0   0   0   0   0   0   0  79   0]
 [  0 113   0   0   0   0   0   0  69   0]
 [  0   0  80   0   0   0   0   0  97   0]
 [  0   0   0  98   0   0   0   0  85   0]
 [  0   0   0   0  88   0   0   0  93   0]
 [  0   0   0   0   0  81   0   0 101   0]
 [  0   0   0   0   0   0 108   0  73   0]
 [  0   0   0   0   0   0   0  66 113   0]
 [  0   0   0   0   0   0   0   0 174   0]
 [  0   0   0   0   0   0   0   0  77 103]]
Accuracy: 0.56

In []: