Week 14: May 8 - 12

Computing With Text

We use Python’s text processing functions to explore the probability distributions of words in some classic texts.

Bioinformatics

Biopython is a set of Python modules containing tools for computational molecular biology. We install Biopython and learn to use one such tool to identify a set of unknown genetic sequences.

Week 14 Notebook

Python

  • String formatting using format.

Class Activity: Bioinformatics

Use the tools available in Biopython to identify the origin of six real nucleotide sequences presented in class. Determine which nucleotide sequence is the fake one. The code we will use for this is shown below.

def identify_sequence(seq_data):
    'Identify a genetic sequence'
    # Second (database) argument can also be "nt"
    results = NCBIWWW.qblast("blastn", "nr", seq_data, hitlist_size=2 )
    records = NCBIXML.parse(results)
    E_VALUE_THRESH = 0.04
    for record in records:
        for alignment in record.alignments:
            for hsp in alignment.hsps:
                if hsp.expect < E_VALUE_THRESH:
                    print('****Alignment****')
                    print('sequence:', alignment.title)
                    print('length:', alignment.length)
                    print('e value:', hsp.expect)
                    nshow = 95
                    if len(hsp.query)<=nshow:
                        print(hsp.query)
                        print(hsp.match)
                        print(hsp.sbjct)
                    else:
                        print(hsp.query[0:nshow-10] + '...' + hsp.query[-10:])
                        print(hsp.match[0:nshow-10] + '...' + hsp.match[-10:])
                        print(hsp.sbjct[0:nshow-10] + '...' + hsp.sbjct[-10:])