Week 14: May 8 - 12¶

Computing With Text¶

We use Python’s text processing functions to explore the probability distributions of words in some classic texts.

Bioinformatics¶

Biopython is a set of Python modules containing tools for computational molecular biology. We install Biopython and learn to use one such tool to identify a set of unknown genetic sequences.

Week 14 Notebook¶

Python¶

String formatting using format.

Class Activity: Bioinformatics¶

Use the tools available in Biopython to identify the origin of six real nucleotide sequences presented in class. Determine which nucleotide sequence is the fake one. The code we will use for this is shown below.

def identify_sequence(seq_data):
    'Identify a genetic sequence'
    # Second (database) argument can also be "nt"
    results = NCBIWWW.qblast("blastn", "nr", seq_data, hitlist_size=2 )
    records = NCBIXML.parse(results)
    E_VALUE_THRESH = 0.04
    for record in records:
        for alignment in record.alignments:
            for hsp in alignment.hsps:
                if hsp.expect < E_VALUE_THRESH:
                    print('****Alignment****')
                    print('sequence:', alignment.title)
                    print('length:', alignment.length)
                    print('e value:', hsp.expect)
                    nshow = 95
                    if len(hsp.query)<=nshow:
                        print(hsp.query)
                        print(hsp.match)
                        print(hsp.sbjct)
                    else:
                        print(hsp.query[0:nshow-10] + '...' + hsp.query[-10:])
                        print(hsp.match[0:nshow-10] + '...' + hsp.match[-10:])
                        print(hsp.sbjct[0:nshow-10] + '...' + hsp.sbjct[-10:])

Report 7¶

Statistical Language Models