MTH 337: Week 15

In [1]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib

Word-Frequency Analysis

Load "Hamlet"

In [2]:
f = open("Hamlet.txt")
text = f.read()
f.close()

Clean up the text

In [3]:
text = text.lower()
for char in ':;,.!?-[]*':
    text = text.replace(char, '')

Split into words

In [4]:
words = text.split()

Python Dictionaries

We need a way to associate a number (the word count) with each word. Python dictionaries provide a way to do this. They are unordered containers of key:value pairs, identified by surrounding curly brackets, { }.

Creating a dictionary

In [5]:
d = {'a':1, 'b':2, 'c':3}
print d
{'a': 1, 'c': 3, 'b': 2}

Accessing a dictionary element

In [6]:
d['a']
Out[6]:
1

Modifying a dictionary element

In [7]:
d['a'] = 10
print d
{'a': 10, 'c': 3, 'b': 2}

Counting the words

In [8]:
wordcount = {}
for word in words:
    if word in wordcount:
        wordcount[word] += 1
    else:
        wordcount[word] = 1

Sorting

Python has a built-in sorted function.

  • sorted sorts any iterable object (lists, strings etc) and returns a sorted list.
  • The key keyword specifies a function to apply to items to extract the property to sort by.
  • The reverse boolean keyword specifies whether to sort acending or descending.
In [9]:
sorted ([2,1,5,4,8])
Out[9]:
[1, 2, 4, 5, 8]
In [10]:
def get_count(x):
    return x[1]
In [11]:
sorted_words = sorted(wordcount.items(), key=get_count, reverse=True)

Pie Charts

A pie chart is a type of graph in which a circle is divided into sectors that each represent a proportion of the whole.

  • They can be created using the Matplotlib pie function.
  • The only required argument is a list of numbers containing the data to chart.
  • Some useful keyword arguments are labels and startangle.
In [12]:
counts = [get_count(pair) for pair in sorted_words]
labels = [pair[0] for pair in sorted_words]
figure(figsize=(6,6))
p = pie(counts, labels=labels)

Clean up the labels

In [13]:
labels = ['']*len(counts)
for i in range(10):
    labels[i] = sorted_words[i][0]
figure(figsize=(6,6))
p = pie(counts, labels=labels)

Frequency-Rank Plots

In [14]:
plot(counts)
xlim(0, 1000)
ylim(0, 200)
Out[14]:
(0, 200)
In [15]:
figure(figsize=(6,6))
loglog(counts)
Out[15]:
[<matplotlib.lines.Line2D at 0x7f5643a89e10>]

random.choice selects an item from a list, where each item has an equal probability of being selected.

In [16]:
random.choice([1, 10, 100,1000])
Out[16]:
100

The Python string method join creates a new string from a list of strings.

In [17]:
''.join(['a', 'b', 'c'])
Out[17]:
'abc'

Biopython

Biopython contains a set of tools for computational molecular biology. It provides:

  • Tools for defining DNA and protein sequences.
  • Tools for parsing different sequence file formats into Python data structures.
  • Interfaces to NCBI - National Center for Biotechnology Information.
  • Interfaces to BLAST - Basic Local Alignment Search Tool.
In [18]:
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML

BLAST

  • Finds regions of local similarity between sequences.
  • Compares nucleotide or protein sequences to sequence databases.
  • Calculates statistical significance of matches. The longer the sequence and the more matching bases, the more significant the match.

The function identify_sequence uses BLAST to identify unknown sequences and print the results.

In [19]:
def identify_sequence(seq_data):
    'Identify a genetic sequence'
    results = NCBIWWW.qblast("blastn", "nt", seq_data, hitlist_size=2 )
    records = NCBIXML.parse(results)
    E_VALUE_THRESH = 0.04
    for record in records:
        for alignment in record.alignments:
            for hsp in alignment.hsps:
                if hsp.expect < E_VALUE_THRESH:
                    print '****Alignment****'
                    print 'sequence:', alignment.title
                    print 'length:', alignment.length
                    print 'e value:', hsp.expect
                    nshow = 95
                    if len(hsp.query)<=nshow:
                        print hsp.query
                        print hsp.match
                        print hsp.sbjct
                    else:
                        print hsp.query[0:nshow-10] + '...' + hsp.query[-10:]
                        print hsp.match[0:nshow-10] + '...' + hsp.match[-10:]
                        print hsp.sbjct[0:nshow-10] + '...' + hsp.sbjct[-10:]