Consider a binary classification problem such as determining if an image of a handwritten character is the number 2 or if it is a different chracter.
If one can represent each images in a space (e.g., a feature space $F$) such that there exists a line, plane or hyperplane that perfectly separates the two classes (2's and non 2's), then the data is linearly separable. One of the most effect methods for solving this classification problem is the Perceptron Learning Algorithm (PLA).
from numpy import *
%pylab inline
d = 2 #2 dimensions
n = 100 #data points
x = random.rand(d,n) #randomly place points in 2D
#print(x)
subplot(111,aspect=1)
plot(x[0],x[1],'o')
# lets use the line y = 1/3 + 2/3 x, or 0=-1-2x+3y
y = sign(dot(array([-2,+3.]),x) - 1)
#print(y)
xp = x.T[y>0].T
xm = x.T[y<0].T
subplot(111,aspect=1)
plot(xp[0],xp[1],'ro')
plot(xm[0],xm[1],'bo');
#Define a 3D vector that is [1,x_1,x_2]
X = empty((d+1,n))
X[0,:] = 1
X[1:,:] = x
W_true = array([-1,-2,3.])
y = sign(dot(W_true,X))
def draw_points(X):
subplot(111,aspect=1)
xp = X[1:,y>0]
xm = X[1:,y<0]
plot(xp[0],xp[1],'ro')
plot(xm[0],xm[1],'bo');
draw_points(X)
# let's open up a little space between the classes
halfgap = 0.05
X[-1,y>0] += halfgap
X[-1,y<0] -= halfgap
draw_points(X)
Our task is identify a separating hyperplane, $W$, using just some of the known labels $y_i$ that can classify the labels of all points. This is a binary supervised linear classification problem. We will use the PLA.
Note that we seek to learn a map $g$ such that $$ g:\mathbb{R}^2\to\{-1,1\}$$
# We are given X and y ONLY!!
W = array([-1,-1,5],dtype=float) # starting guess at separating W
while(True):
misclassified = sign(dot(W,X)) != y
if not any( misclassified ): break
i = random.choice( arange(n)[misclassified] ) # random misclassified point
W += y[i]*X[:,i] # PLA step
print(W)
Let's do it again, but with graphics.
First, let's create a function that draws the guess at W
def drawline(W,color,alpha):
plot([0,1],[-W[0]/W[2],-(W[0]+W[1])/W[2]],color=color,alpha=alpha)
This time, we will create a fuction
def PLA(X,W0,max_counter,visualize):
W=W0
counter = 0
while(True):
counter+=1
misclassified = sign(dot(W,X)) != y
if not any( misclassified ):
break
i = random.choice( arange(n)[misclassified] ) # random misclassified point
W += y[i]*X[:,i]; # PLA step
if visualize==1:
drawline(W,'k',min(.05*counter,1))
#print(sum(misclassified))
if visualize==1:
drawline(W,'k',1)
drawline(W_true,'c',1) #actual
error=sum(misclassified)# total error
return W,error
visualize = 1 #should we visualize?
if visualize==1:
draw_points(X)# draw points
xlim(0,1)
ylim(0,1)
max_counter = 100
W0 = array([-1,-1,5],dtype=float)# starting guess at separating plane
W,error = PLA(X,W0,max_counter,visualize)
print('W = '+str(W),'\n', 'Error = '+str(error/n*100) +'%')
So far, we have trained the classifier on all datapoints and tested the classifier on all data points. For this reason, and because the data is designed to be linearly seperable, the algorithm eventially converges and eventually there is 0 classification error. In other words, we obtain perfect classification for our data.
However, we are only able to know it's reached perfect classification error because we know all the classification labels. This is unrealistic. In the real world, we will know the labels for some datapoints and not for others. How then can we tell if our classification algorithm is going well ... cross validation.
Below is some code and an outline to get you started
Define 2 functions to create points and draw points
def create_data(n,d,W,halfgap):
x = random.rand(d,n) #randomly place points in 2D
X = empty((d+1,n))
X[0,:] = 1
X[1:,:] = x
y = sign(dot(W,X))
X[-1,y>0] += halfgap
X[-1,y<0] -= halfgap
return X,y
# function that draws points. The first s-fraction of the data are training, and the rest are testing data
def draw_points2(X,y,s,n):
subplot(111,aspect=1)
xp = X[1:,y>0]
xm = X[1:,y<0]
plot(xp[0][0:int(s*n)],xp[1][0:int(s*n)],'ro')# train data
plot(xm[0][0:int(s*n)],xm[1][0:int(s*n)],'bo')# train data
plot(xp[0][int(s*n):n],xp[1][int(s*n):n],'rx')# test data
plot(xm[0][int(s*n):n],xm[1][int(s*n):n],'bx')# test data
Chose parameters and create/visualize data
Circles are training data and x's are testing data
d = 2 #2 dimensions
n = 200 #data points
s = 0.1 # fraction training data
halfgap = 0.05
X,y = create_data(n,d,W,halfgap);
draw_points2(X,y,s,n)
***Hint: I'd split the data X and labels y into 2 parts: X_train/X_test and y_train/y_test. Then you can build a classifier using only the train data and labels.
***Hint: Modularize your code and define functions to allow you to package and re-use code.