Handwritten characters: feature extraction¶

Make sure you've downloaded and unzipped the image dataset.

As a first step, we need to map the set of images as a set of points in some space. We could consider the pixels as giving the coordinates for the images, but notice that if a character is moved slightly, then the pixels can change significantly. We may instead want to identify coordinates that relate to different aspects, or features, of the images.

from PIL import Image
import glob
from numpy import *
%pylab inline
#set_printoptions(linewidth=200)

Populating the interactive namespace from numpy and matplotlib

First get a list of images in folder pngs. Let's only consider periods '.', 8's and 1's.¶

pngs = []
pngs += glob.glob('pngs/*_09.png')#periods
pngs += glob.glob('pngs/*__8.png')#8's
pngs += glob.glob('pngs/*__1.png')#1's
for png in pngs:
    print(png)

Let's open and view one image of each class¶

img = Image.open(pngs[10])
#img = Image.open(pngs[400])
#img = Image.open(pngs[600])
imshow(img)

<matplotlib.image.AxesImage at 0x112a090b8>

Now we need to convert each image to a vector or matrix¶

for png in pngs:
#    print(png)
    img = Image.open(png)
    a = array(img)
#    print( a.shape )
    a = a[:,:,0]   #  select the red layer (because red, green, blue all the same)
#    print( a.shape,a.dtype,a.max() )
    a = array(255-a,dtype=float)
#    print( a.shape,a.dtype,a.max() )
    h,w = a.shape
#     for i in range(h):
#         for j in range(w):
#             print( str( int( a[i,j]> 0) ) , end='' )
#         print()
    break

Extract features¶

n = len(pngs)
features = ['ink','width','height','topheaviness','rightheaviness','log aspect']
d = len(features)
F = empty((n,d))  # array of feature vectors


for i,png in enumerate(pngs): # what is i here?
    img = Image.open(png)
    a = array(img)
    #print( a.shape )
    a = a[:,:,0]   #  select the red layer (because red, green, blue all the same)
    h,w = a.shape

    x = linspace(0,w,w,endpoint=False)
    y = linspace(0,h,h,endpoint=False)

    X,Y = meshgrid(x,y)
    #print(Y)
    
    #print( a.shape,a.dtype,a.max() )
    a = array(255-a,dtype=float)
    
    ink = a.sum()
    F[i,0] = ink/(255*w*h/5)   # can we normalize this sensibly?
    
    xmin = X[ a>0 ].min()  # the minimum x value where a>0
    xmax = X[ a>0 ].max()
    ymin = Y[ a>0 ].min()  # the minimum y value where a>0
    ymax = Y[ a>0 ].max()
    width  = xmax - xmin
    height = ymax - ymin
    F[i,1] = width/w   # can we normalize this sensibly?
    F[i,2] = height/h   # can we normalize this sensibly?
    
    xc = (xmin+xmax)/2   # center of character
    yc = (ymin+ymax)/2
    
    # could alteranatively use center of mass
    # xc = (a*X).sum()/ink
    # yc = (a*Y).sum()/ink
    
    # total ink above center
    F[i,3] = a[ Y>yc ].sum()/ink

    # total ink to the right of center
    F[i,4] = a[ X>xc ].sum()/ink

    # log of aspect ratio
    F[i,5] = log10(height/width)
    
    break
#print(F)

[[   0.    0.    0. ...,    0.    0.    0.]
 [   1.    1.    1. ...,    1.    1.    1.]
 [   2.    2.    2. ...,    2.    2.    2.]
 ..., 
 [ 122.  122.  122. ...,  122.  122.  122.]
 [ 123.  123.  123. ...,  123.  123.  123.]
 [ 124.  124.  124. ...,  124.  124.  124.]]

We cannot make a picture of dots in 6D space. However, we can make an array of all coordinate plane projections.¶

figure(figsize=(12,12))
for i in range(d):
    for j in range(d):
        # plot the i,j coordinate plane projections
        subplot(d,d,i*d+j+1)
        if i==j: 
            text(.5,.5,features[i],ha='center')
        else:
            plot(F[:,j],F[:,i],'b.',alpha=0.5)
        xticks([])
        yticks([])

It would be better of color indicated which points are '.','1', and '8'¶

c = list(set([ png[-6:-4] for png in pngs]))
# what does set do ?
print(c)

['_1', '09', '_8']

colors = 'cmb'
colordict = {k:colors[i] for i,k in enumerate(c)}
colordict

{'09': 'm', '_1': 'c', '_8': 'b'}

figure(figsize=(12,12))
for i in range(d):
    for j in range(d):
        # plot the i,j coordinate plane projections
        subplot(d,d,i*d+j+1)
        if i==j: 
            text(.5,.5,features[i],ha='center')
        else:
            for k,png in enumerate(pngs):
                plot(F[k,j],F[k,i],'.',alpha=0.1,color=colordict[png[-6:-4]])
        xticks([])
        yticks([])

Which features are best at distinguishing the characters?¶