Notes on Report 3: Now due Wednesday April 25 at 11:59pm

In a nutshell: build and test a classifier for these handwritten characters, using a decision tree and the SVM algorithm that we will study today. You may not use any canned machine-learning package, like sklearn: your code must all be written from scratch except for solving the quadratic programming problem with cvxopt.

Cross validation: Take some "training" subset of the PNG images that you think you might to use to to develop a classifier, and another subset that you will use to test the quality of your classifier.

Example 1

Let's classify a subset of the handwritten characters using an SVM and a decision tree

In [210]:
from PIL import Image
import glob
from numpy import *
%pylab inline
Populating the interactive namespace from numpy and matplotlib

Load image names from a folder and extract the image labels from these filenames

In [211]:
def load_images_extract_labels(images_foldername,characters_to_load):
    pngs = []
    for c in characters_to_load:
        pngs += glob.glob('pngs/*_'+c+'.png')
    
    labels = []
    for i,png in enumerate(pngs): 
        labels.append(png[-6:-4])
        
    return pngs,labels

Use the labels to create a y vector containing +1 and -1

In [215]:
images_foldername = 'pngs'
characters_to_load = ['09','_8']# lets first classify '.' and 8's
pngs,labels = load_images_extract_labels(images_foldername,characters_to_load)
#print(labels)
#print([label=='09' for label in labels])
y_true = array([1*(label=='09') for label in labels]) + array([-1*(label=='_8') for label in labels])
print(y_true)
[ 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]

Extract features for these images (feel free to come up with your own features :-)

In [216]:
def extract_6features(png):

    n = len(pngs)
    features = ['ink','width','height','topheaviness','rightheaviness','log aspect']
    d = len(features)
    F = empty((n,d))  # array of feature vectors


    for i,png in enumerate(pngs): # what is i here?
        img = Image.open(png)
        a = array(img)
        #print( a.shape )
        a = a[:,:,0]   #  select the red layer (because red, green, blue all the same)
        h,w = a.shape

        x = linspace(0,w,w,endpoint=False)
        y = linspace(0,h,h,endpoint=False)

        X,Y = meshgrid(x,y)
        #print(Y)

        #print( a.shape,a.dtype,a.max() )
        a = array(255-a,dtype=float)

        ink = a.sum()
        F[i,0] = ink/(255*w*h/5)   # can we normalize this sensibly?

        xmin = X[ a>0 ].min()  # the minimum x value where a>0
        xmax = X[ a>0 ].max()
        ymin = Y[ a>0 ].min()  # the minimum y value where a>0
        ymax = Y[ a>0 ].max()
        width  = xmax - xmin
        height = ymax - ymin
        F[i,1] = width/w   # can we normalize this sensibly?
        F[i,2] = height/h   # can we normalize this sensibly?

        xc = (xmin+xmax)/2   # center of character
        yc = (ymin+ymax)/2

        # could alteranatively use center of mass
        # xc = (a*X).sum()/ink
        # yc = (a*Y).sum()/ink

        # total ink above center
        F[i,3] = a[ Y>yc ].sum()/ink

        # total ink to the right of center
        F[i,4] = a[ X>xc ].sum()/ink

        # log of aspect ratio
        F[i,5] = log10(height/width)

    return features,F
In [219]:
features,F = extract_6features(pngs)
shape(F)
Out[219]:
(449, 6)

Create a function to visualize the 2D projects of these features

In [222]:
def visualize_2D_projections(F,features,pngs,label_vector):
    #colors = 'cmbgrkyw'
    #distinct_labels = list(set(label_vector))
    #label_dict = {distinct_labels[i]:i for i in range(2)}
    #print(label_dict)
    d = shape(F)[1] # number_features
    figure(figsize=(15,15))
    for i in range(d):
        for j in range(d):
            # plot the i,j coordinate plane projections
            subplot(d,d,i*d+j+1)
            if i==j: 
                text(.5,.5,features[i],ha='center')
            else:
                #for k in range(len(label_vector)):
                    #print( colors[label_dict[label_vector[k]]])
                scatter(F[:,j],F[:,i], s=2, c=label_vector, marker=(5, 0))
                #plot(F[:,j],F[:,i],'.',alpha=0.1,color=colors[label_dict[label_vector[:]]] )
                #plot([0,1],[-W[0]/W[i+1],-(W[0]+W[i+1])/W[j+1]],color='k')
            xticks([])
            yticks([])            
    return

#visualize_2D_projections(F,features,pngs,y_true)

Run these to extract and visualize data

In [224]:
images_foldername = 'pngs'
characters_to_load = ['_1','_8']

pngs,labels = load_images_extract_labels(images_foldername,characters_to_load)
features,F = extract_6features(pngs)
y_true = array([1*(label=='_1') for label in labels]) + array([-1*(label=='_8') for label in labels])

visualize_2D_projections(F,features,pngs,y_true)

Let's classify these using a SVM. That is, let's find a hyperplane in the 6-dimensional space that splits the datapoints for the 2 label types

In [225]:
from cvxopt import matrix,solvers
def apply_SVM(F,y):
    
    n,d=shape(F)
    X = empty((d+1,n))
    X[0,:] = 1
    X[1:,:] = F.T 

    P = eye(d+1)
    P[0,0] = 0
    q = zeros(d+1)
    G = (-X*y).T
    h = -ones(n)

    P = matrix(P)
    q = matrix(q)
    G = matrix(G)
    h = matrix(h)
    solvers.options['show_progress'] = False
    sol=solvers.qp(P,q,G,h)

    W = array(sol['x']).reshape(d+1)

    return W,X

Now lets put it all together

In [258]:
images_foldername = 'pngs'
characters_to_load = ['09','_2']

pngs,labels = load_images_extract_labels(images_foldername,characters_to_load)
features,F = extract_6features(pngs)

y_true = array([1*(label==characters_to_load[0]) for label in labels]) + array([-1*(label==characters_to_load[1]) for label in labels])
print(y_true)

visualize_2D_projections(F,features,pngs,y_true)

W,X = apply_SVM(F,y_true)
y = sign(dot(W,X))
Error = 1- sum(y==y_true)/shape(F)[0]
print(Error)
[ 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
0.0

Note that sometimes the classifier fails :-(

Report 3

1. Make a table describing which pairs of characters can be accurately distinguished and which can't. Keep track of the error rates for each pair of characters if they are separable.

  • Hint: Perhaps use try/catch so that you can use for loops to search across all pairs of characters and not be disrupted by cvxopt giving errors.

2. For a pair of characters (for example, the periods and the and 2's), implement cross-validation similar to what you did in homework 4. Plot the test error as a function of the training fraction

  • Hint: You can make this plot for just one pair of characters.

3. Using 3/4 the data for training and 1/4 for testing, compute the hyperplane W for all the pairs of characters in which it can be computed. Then, for each image x in the test set, compute sign(dot(W,x)) for all the different W values. For each image x, some of these values should be close to +1 or -1 depending on whether or not the image is a particular character. Comment on what you find. See if you can classify a given image x based on the values sign(dot(W,x)) for the different W's.

  • The goal is to build a classifier based on the SVM: Given an image from the testing image set, classify the image as a particular character using hyperplanes that are trained on the training image set.

For part 3 ... be creative! There is no single correct answer. Make plots that provide insights. Explore something. Use the SVM to ask and answer your own questions about the classification of the images using a 6-dimensional feature space.