The PageRank algorithm (Brin and Page, 1998) was introduced by Google founders Larry Page and Sergey Brin as a method to rank webpages using only knowledge of the topology of the internet as indicated by the network of directed hyperlinks connecting webpages. The PageRank algorithm models web surfers as random walkers (caution: please be careful when clicking hyperlinks uniform!) and can be analyzed using Markov-chain theory (Durrett, 2010). See (Langville and Meyer, 1998) for an excellent introduction to PageRank and its variations. See (Gleich, 1998) for a review of additional methods and a survey of the algorithm's use across diverse applications. Finally---shameless plug, beware---, see (Taylor et al., 2017) for a method extending PageRank and other eigenvector-based centrality measures to temporal and multilayer networks.
We interest in studying random walks on a web graph. We define the adjacency matrix $A$ having entries $$A_{ij} = \left\{\begin{array}{rl}1,&(i,j)\in\mathcal{E}\\ 0,&(i,j)\not\in\mathcal{E}\end{array} \right. \nonumber $$
Letting $d_i^{in}=\sum_j A_{ij}$ denote the number of out-going edges from node $i$ to other nodes, we define the transition matrix $P$ having entries \begin{equation}\label{eq:transition_matrix} P_{ij} = A_{ij}/d_i^{out} . \end{equation} This equation was written using the following code:
Note that $\sum_j P_{ij} = 1$ for every $i\in\{1,\dots,N\}$, implying that $P{\bf 1}={\bf 1}$. That is, $\lambda_1=1$ is an eigenvalue of $P$ and ${\bf 1}=[1,1,1,1]^T$ is its associated right eigenvector. We will later show that this is the dominant eigenvalue and dominant right eigenvector. The PageRank vector corresponds to the dominant left eigenvector of $P$, which solves \begin{equation}\label{eq:pagerank1} {\bf x}^T P = {\bf x}^T. \nonumber \end{equation}
We now formally define the PageRank vector.
Let $G(\mathcal{V},\mathcal{E})$ denote a graph with nodes $\mathcal{V}$ and edges $\mathcal{E}$ and $P$ its associated transition matrix for the Markov chain for an unbiased random walk on the graph. The PageRank vector ${\bf x}$ for the graph is given by the left dominant eigenvector of $P$ that solves \begin{equation}\label{eq:pagerank2} {x}^T P = {x}^T. \end{equation}
We now present an important theorem regarding the positivity and uniqueness for entries $x_i$ in ${\bf x}$.
Suppose $G(\mathcal{V},\mathcal{E})$ corresponds to strongly connected graph. Then $x_i>0$ for each $i$ and the solution ${\bf x}$ to Eq. \eqref{eq:pagerank2} is unique. The result follows directly from Perron-Frobenius theorem for positive matrices (Bapat, 1997).
The proof to Theorem \ref{theo:pagerank} is quite generally, and holds true for any centrality matrix corresponding to a strongly connected graph---see, for example, the discussion in (Taylor et al., 2017). Alternatively, one can also prove Theorem \ref{theo:pagerank} using theory spacifically for Markov chains (Durrett, 2010).
from numpy import *
import networkx as nx
%pylab inline
def load_karate_club_data():
network_file = open("karate/karate.txt", "r")
lines = network_file.readlines()
edge_list = zeros((len(lines),2),dtype=int)
for i in range(len(lines)):
temp = lines[i].split(' ')
edge_list[i,:] = [int(temp[0]),int(temp[1])]
node_list = {}
for k in range(int(1+edge_list.max())):
node_list[k] = str(k)
return node_list,edge_list
node_list,edge_list = load_karate_club_data()
G = nx.karate_club_graph()
pos=nx.spring_layout(G) # positions for all nodes
nx.draw_networkx_nodes(G,pos,nodelist=G.nodes,edgelist=G.edges,node_size=300,alpha=0.9,with_labels = True)
nx.draw_networkx_edges(G,pos,edgelist=G.edges,edge_color='k',width=3,alpha=0.3)
nx.draw_networkx_labels(G,pos,node_list,font_size=13)
plt.axis('off')
plt.rc('text', usetex=True)
plt.rc('font', family='serif')
plt.title(r"Zachary Karate Club",fontsize=16, color='black')
plt.show()
This network is widely used for research because it famously split into 2 communities. Two of the nodes, node 0 and node 33, were the club president and the instructor. They had a disagreement (I know, I know, ... a fight broke out at the karate club) and the club split in half. For this reason, the network is often used as a test example for community detection (i.e., clustering) algorithms. A social scientist, W. W. Zachary, happend to be studying the club as a social network, and published a paper about the social dynamics in 1977 [W. W. Zachary. An information flow model for conflict and fission in small groups. Journal of Anthropological Research, 33(4):452–473, 1977.] See this paper for more background information.
from scipy.sparse import *
node_list,edge_list = load_karate_club_data()
N = len(node_list)
# convert edge list into a sparse adjacency matrix A
A = csc_matrix((ones(len(edge_list[:,0])), (edge_list[:,0], edge_list[:,1])), shape=(N, N))
A = A + A.T # make the adjacency matrix symmetric - only do this step for the Karate Club
# visualize the sparse matrix
fig, ax = plt.subplots(figsize=(5, 5))
ax.spy(A)
# first construct a row-stochastic matrix B
d = sum(A,axis=1)# sum over a row is called a node degree
B = A
for i in range(N):
#print(B[i,:]/d[i])
B[:,i] = B[:,i]/d[i]
# B is now a stochastic matrix and represents a markov chain
sum(B,axis=0)
#now define a pagerank step
def pagerank_step(x,alpha):
x = (1-alpha)*B.dot(x) + alpha * ones(N)/N
return (x)
x = ones(N)/N
x2 = B.dot(x)
alpha = 0.85 # the teleportation rate
x3 = pagerank_step(x,alpha)
plot(x)
plot(x2)
plot(x3)
legend(['x1','x2','x3'])
def pagerank(B,alpha,epsilon,max_iterations):
x = ones(N)/N
for k in range(max_iterations):
x2 = pagerank_step(x,alpha)
if linalg.norm(x-x2) < epsilon:
break
else:
x = x2
return x
alpha = 0.85 # the teleportation rate
epsilon = 10**-6
max_iterations = 1000
PR = pagerank(B,alpha,epsilon,max_iterations)
PR
node_ranking = argsort(-PR)
node_ranking
G = nx.karate_club_graph()
pos=nx.spring_layout(G) # positions for all nodes
nx.draw_networkx_nodes(G,pos,nodelist=G.nodes,edgelist=G.edges,
node_color=PR*10,node_size=300000*PR**2,alpha=0.9)
nx.draw_networkx_edges(G,pos,edgelist=G.edges,edge_color='k',width=3,alpha=0.3)
nx.draw_networkx_labels(G,pos,node_list,font_size=13)
plt.axis('off')
plt.rc('text', usetex=True)
plt.rc('font', family='serif')
plt.title(r"PageRank for Zachary Karate Club",fontsize=16, color='black')
plt.show()
<img src="http://www.acsu.buffalo.edu/~danet/Sp18/MTH309/MTH309_files/webgraph.jpeg" width =500>
# load the dataset
def load_california_data():
network_file = open("california/california.dat", "r")
lines = network_file.readlines()
nodes = []
edge_list = []
for i in range(len(lines)):
temp = lines[i].split(' ')
if temp[0]=='n':
nodes.append(temp[2])
if temp[0]=='e':
edge_list.append([temp[1],temp[2].replace('.\n','')])
edge_list = array(edge_list)
node_list = {}
for k in range(len(nodes)):
node_list[k] = nodes[k]
return node_list,edge_list
from scipy.sparse import coo_matrix
node_list,edge_list = load_california_data()
N = len(node_list)
# convert edge list into a sparse adjacency matrix
A = csc_matrix((ones(len(edge_list[:,0])), (edge_list[:,0], edge_list[:,1])), shape=(N,N))
# visualize the sparse matrix
fig, ax = plt.subplots(figsize=(50, 50))
ax.spy(A)
node_list
# compute degrees
d_out = array(sum(A,axis=0)).T# sum over a row is called a node degree
d_in = array(sum(A,axis=1))# sum over a row is called a node degree
scatter(d_in,d_out)
xlabel('in degrees')
ylabel('out degrees')
edge_weights = ones(len(edge_list[:,0]))
for e in range(len(edge_list[:,0])):
i = int(edge_list[e,1])
if d_out[i]>0:
edge_weights[e] = edge_weights[e] / d_out[i]
B = csc_matrix((edge_weights, (edge_list[:,0], edge_list[:,1])), shape=(N,N))
print(array(sum(B,axis=0)))
print(array(sum(B,axis=1)).T)