Handwritten characters: feature extraction

Make sure you've downloaded and unzipped the image dataset.

As a first step, we need to map the set of images as a set of points in some space. We could consider the pixels as giving the coordinates for the images, but notice that if a character is moved slightly, then the pixels can change significantly. We may instead want to identify coordinates that relate to different aspects, or features, of the images.

In [9]:
from PIL import Image
import glob
from numpy import *
%pylab inline
#set_printoptions(linewidth=200)
Populating the interactive namespace from numpy and matplotlib

First get a list of images in folder pngs. Let's only consider periods '.', 8's and 1's.

In [ ]:
pngs = []
pngs += glob.glob('pngs/*_09.png')#periods
pngs += glob.glob('pngs/*__8.png')#8's
pngs += glob.glob('pngs/*__1.png')#1's
for png in pngs:
    print(png)

Let's open and view one image of each class

In [25]:
img = Image.open(pngs[10])
#img = Image.open(pngs[400])
#img = Image.open(pngs[600])
imshow(img)
Out[25]:
<matplotlib.image.AxesImage at 0x112a090b8>

Now we need to convert each image to a vector or matrix

In [46]:
for png in pngs:
#    print(png)
    img = Image.open(png)
    a = array(img)
#    print( a.shape )
    a = a[:,:,0]   #  select the red layer (because red, green, blue all the same)
#    print( a.shape,a.dtype,a.max() )
    a = array(255-a,dtype=float)
#    print( a.shape,a.dtype,a.max() )
    h,w = a.shape
#     for i in range(h):
#         for j in range(w):
#             print( str( int( a[i,j]> 0) ) , end='' )
#         print()
    break

Extract features

In [50]:
n = len(pngs)
features = ['ink','width','height','topheaviness','rightheaviness','log aspect']
d = len(features)
F = empty((n,d))  # array of feature vectors


for i,png in enumerate(pngs): # what is i here?
    img = Image.open(png)
    a = array(img)
    #print( a.shape )
    a = a[:,:,0]   #  select the red layer (because red, green, blue all the same)
    h,w = a.shape

    x = linspace(0,w,w,endpoint=False)
    y = linspace(0,h,h,endpoint=False)

    X,Y = meshgrid(x,y)
    #print(Y)
    
    #print( a.shape,a.dtype,a.max() )
    a = array(255-a,dtype=float)
    
    ink = a.sum()
    F[i,0] = ink/(255*w*h/5)   # can we normalize this sensibly?
    
    xmin = X[ a>0 ].min()  # the minimum x value where a>0
    xmax = X[ a>0 ].max()
    ymin = Y[ a>0 ].min()  # the minimum y value where a>0
    ymax = Y[ a>0 ].max()
    width  = xmax - xmin
    height = ymax - ymin
    F[i,1] = width/w   # can we normalize this sensibly?
    F[i,2] = height/h   # can we normalize this sensibly?
    
    xc = (xmin+xmax)/2   # center of character
    yc = (ymin+ymax)/2
    
    # could alteranatively use center of mass
    # xc = (a*X).sum()/ink
    # yc = (a*Y).sum()/ink
    
    # total ink above center
    F[i,3] = a[ Y>yc ].sum()/ink

    # total ink to the right of center
    F[i,4] = a[ X>xc ].sum()/ink

    # log of aspect ratio
    F[i,5] = log10(height/width)
    
    break
#print(F)
[[   0.    0.    0. ...,    0.    0.    0.]
 [   1.    1.    1. ...,    1.    1.    1.]
 [   2.    2.    2. ...,    2.    2.    2.]
 ..., 
 [ 122.  122.  122. ...,  122.  122.  122.]
 [ 123.  123.  123. ...,  123.  123.  123.]
 [ 124.  124.  124. ...,  124.  124.  124.]]

We cannot make a picture of dots in 6D space. However, we can make an array of all coordinate plane projections.

In [51]:
figure(figsize=(12,12))
for i in range(d):
    for j in range(d):
        # plot the i,j coordinate plane projections
        subplot(d,d,i*d+j+1)
        if i==j: 
            text(.5,.5,features[i],ha='center')
        else:
            plot(F[:,j],F[:,i],'b.',alpha=0.5)
        xticks([])
        yticks([])
            

It would be better of color indicated which points are '.','1', and '8'

In [54]:
c = list(set([ png[-6:-4] for png in pngs]))
# what does set do ?
print(c)
['_1', '09', '_8']
In [58]:
colors = 'cmb'
colordict = {k:colors[i] for i,k in enumerate(c)}
colordict
Out[58]:
{'09': 'm', '_1': 'c', '_8': 'b'}
In [59]:
figure(figsize=(12,12))
for i in range(d):
    for j in range(d):
        # plot the i,j coordinate plane projections
        subplot(d,d,i*d+j+1)
        if i==j: 
            text(.5,.5,features[i],ha='center')
        else:
            for k,png in enumerate(pngs):
                plot(F[k,j],F[k,i],'.',alpha=0.1,color=colordict[png[-6:-4]])
        xticks([])
        yticks([])

Which features are best at distinguishing the characters?