%pylab inline
Load "Hamlet"
f = open("Hamlet.txt")
text = f.read()
f.close()
Clean up the text
text = text.lower()
for char in ':;,.!?-[]*':
text = text.replace(char, '')
Split into words
words = text.split()
We need a way to associate a number (the word count) with each word. Python dictionaries provide a way to do this. They are unordered containers of key:value pairs, identified by surrounding curly brackets, { }.
Creating a dictionary
d = {'a':1, 'b':2, 'c':3}
print d
Accessing a dictionary element
d['a']
Modifying a dictionary element
d['a'] = 10
print d
Counting the words
wordcount = {}
for word in words:
if word in wordcount:
wordcount[word] += 1
else:
wordcount[word] = 1
Python has a built-in sorted function.
sorted ([2,1,5,4,8])
def get_count(x):
return x[1]
sorted_words = sorted(wordcount.items(), key=get_count, reverse=True)
A pie chart is a type of graph in which a circle is divided into sectors that each represent a proportion of the whole.
counts = [get_count(pair) for pair in sorted_words]
labels = [pair[0] for pair in sorted_words]
figure(figsize=(6,6))
p = pie(counts, labels=labels)
Clean up the labels
labels = ['']*len(counts)
for i in range(10):
labels[i] = sorted_words[i][0]
figure(figsize=(6,6))
p = pie(counts, labels=labels)
plot(counts)
xlim(0, 1000)
ylim(0, 200)
figure(figsize=(6,6))
loglog(counts)
random.choice selects an item from a list, where each item has an equal probability of being selected.
random.choice([1, 10, 100,1000])
The Python string method join creates a new string from a list of strings.
''.join(['a', 'b', 'c'])
Biopython contains a set of tools for computational molecular biology. It provides:
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
BLAST
The function identify_sequence uses BLAST to identify unknown sequences and print the results.
def identify_sequence(seq_data):
'Identify a genetic sequence'
results = NCBIWWW.qblast("blastn", "nt", seq_data, hitlist_size=2 )
records = NCBIXML.parse(results)
E_VALUE_THRESH = 0.04
for record in records:
for alignment in record.alignments:
for hsp in alignment.hsps:
if hsp.expect < E_VALUE_THRESH:
print '****Alignment****'
print 'sequence:', alignment.title
print 'length:', alignment.length
print 'e value:', hsp.expect
nshow = 95
if len(hsp.query)<=nshow:
print hsp.query
print hsp.match
print hsp.sbjct
else:
print hsp.query[0:nshow-10] + '...' + hsp.query[-10:]
print hsp.match[0:nshow-10] + '...' + hsp.match[-10:]
print hsp.sbjct[0:nshow-10] + '...' + hsp.sbjct[-10:]