MTH 337: Week 15¶

%pylab inline

Populating the interactive namespace from numpy and matplotlib

Word-Frequency Analysis¶

Load "Hamlet"

f = open("Hamlet.txt")
text = f.read()
f.close()

Clean up the text

text = text.lower()
for char in ':;,.!?-[]*':
    text = text.replace(char, '')

Split into words

words = text.split()

Python Dictionaries¶

We need a way to associate a number (the word count) with each word. Python dictionaries provide a way to do this. They are unordered containers of key:value pairs, identified by surrounding curly brackets, { }.

Creating a dictionary

d = {'a':1, 'b':2, 'c':3}
print d

{'a': 1, 'c': 3, 'b': 2}

Accessing a dictionary element

d['a']

1

Modifying a dictionary element

d['a'] = 10
print d

{'a': 10, 'c': 3, 'b': 2}

Counting the words

wordcount = {}
for word in words:
    if word in wordcount:
        wordcount[word] += 1
    else:
        wordcount[word] = 1

Sorting¶

Python has a built-in sorted function.

sorted sorts any iterable object (lists, strings etc) and returns a sorted list.
The key keyword specifies a function to apply to items to extract the property to sort by.
The reverse boolean keyword specifies whether to sort acending or descending.

sorted ([2,1,5,4,8])

[1, 2, 4, 5, 8]

def get_count(x):
    return x[1]

sorted_words = sorted(wordcount.items(), key=get_count, reverse=True)

Pie Charts¶

A pie chart is a type of graph in which a circle is divided into sectors that each represent a proportion of the whole.

They can be created using the Matplotlib pie function.
The only required argument is a list of numbers containing the data to chart.
Some useful keyword arguments are labels and startangle.

counts = [get_count(pair) for pair in sorted_words]
labels = [pair[0] for pair in sorted_words]
figure(figsize=(6,6))
p = pie(counts, labels=labels)

Clean up the labels

labels = ['']*len(counts)
for i in range(10):
    labels[i] = sorted_words[i][0]
figure(figsize=(6,6))
p = pie(counts, labels=labels)

Frequency-Rank Plots¶

plot(counts)
xlim(0, 1000)
ylim(0, 200)

(0, 200)

figure(figsize=(6,6))
loglog(counts)

[<matplotlib.lines.Line2D at 0x7f5643a89e10>]

random.choice selects an item from a list, where each item has an equal probability of being selected.

random.choice([1, 10, 100,1000])

100

The Python string method join creates a new string from a list of strings.

''.join(['a', 'b', 'c'])

'abc'

Biopython¶

Biopython contains a set of tools for computational molecular biology. It provides:

Tools for defining DNA and protein sequences.
Tools for parsing different sequence file formats into Python data structures.
Interfaces to NCBI - National Center for Biotechnology Information.
Interfaces to BLAST - Basic Local Alignment Search Tool.

from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML

BLAST

Finds regions of local similarity between sequences.
Compares nucleotide or protein sequences to sequence databases.
Calculates statistical significance of matches. The longer the sequence and the more matching bases, the more significant the match.

The function identify_sequence uses BLAST to identify unknown sequences and print the results.

def identify_sequence(seq_data):
    'Identify a genetic sequence'
    results = NCBIWWW.qblast("blastn", "nt", seq_data, hitlist_size=2 )
    records = NCBIXML.parse(results)
    E_VALUE_THRESH = 0.04
    for record in records:
        for alignment in record.alignments:
            for hsp in alignment.hsps:
                if hsp.expect < E_VALUE_THRESH:
                    print '****Alignment****'
                    print 'sequence:', alignment.title
                    print 'length:', alignment.length
                    print 'e value:', hsp.expect
                    nshow = 95
                    if len(hsp.query)<=nshow:
                        print hsp.query
                        print hsp.match
                        print hsp.sbjct
                    else:
                        print hsp.query[0:nshow-10] + '...' + hsp.query[-10:]
                        print hsp.match[0:nshow-10] + '...' + hsp.match[-10:]
                        print hsp.sbjct[0:nshow-10] + '...' + hsp.sbjct[-10:]