Python Review

BCH 519
Spring 2021

Andrew E. Bruno
aebruno2@buffalo.edu

Topics Covered

  • Tips & Tricks
  • Better CLI argument parsing
  • Parsing data (list of dictionaries)
  • Exercise Review

Python Interactive Mode

  • Read, Eval, Print and Loop (REPL)
  • Useful for quick debugging or testing code
$ python3
Python 3.7 (default, Sep 16 2015, 09:25:04)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> the_world_is_flat = True
>>> if the_world_is_flat:
...     print("Be careful not to fall off!")
...
Be careful not to fall off!
https://docs.python.org/3/tutorial/interpreter.html#interactive-mode

Python Virtual Environments

  • Each virtual environment has its own Python binary
  • Allows you to install packages in a local target dir
  • Independent set of installed packages
  • Can use pip to install packages
$ python3 -m venv /path/to/venv
$ source /path/to/venv/bin/activate
$ pip install numpy
$ python3
>>> import numpy
>>> numpy.__version__
'1.16.1'
https://docs.python.org/3/library/venv.html

Parsing command line arguments

  • How to write a program that uses CLI flags?
  • Use builtin module argparse
import argparse

parser = argparse.ArgumentParser(description="My program")
parser.add_argument("--verbose", action="store_true")
parser.add_argument("--cutoff", help="The cutoff value")
parser.add_argument("--input", help="Path to input file")

args = parser.parse_args()
if args.verbose:
    print("Verbose option set")
$ python3 cli-advanced.py --verbose
Verbose option set
https://docs.python.org/3/library/argparse.html

Parsing data using data structures

  • Recall typical I/O scenario
    • Get command line arguments
    • Open files for reading and/or writing
    • Read data and process
    • Write output, close files
  • Data from file can be stored in a list of dictionaries
  • Then can process the list of data “records”
import argparse
parser = argparse.ArgumentParser(description="Parse data")
parser.add_argument("--input", help="Path to input file")
args = parser.parse_args()
data = []
with open(args.input, "r") as fin:
    for line in fin:
        cols = line.strip().split("\t")
        record = {
            "name": cols[1],
            "chrom": cols[2],
        }
        data.append(record)

for rec in data:
    print(rec["name"])

Parsing data using data structures

Test with refseq-genes.txt from HW 1

$ python3 parse-data.py --input refseq-genes.txt
name
NM_032291
NM_052998
NM_001080397
NM_013943
NM_032785
NM_018090
NM_001145278
NM_001145277
NM_001918

Reading and Writing Files

Exercises and Solutions

Splitting genomic DNA

Here’s a short section of genomic DNA:

ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATC
GATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT

It comprises two exons and an intron. The first exon runs from the start of the sequence to the sixty-third character, and the second exon runs from the ninety-first character to the end of the sequence. Write a program that will split the genomic DNA into coding and non-coding parts, and write these sequences to two separate files.

Solution 1

seq  = 'ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCAT'
seq += 'GTAGCTACTCGATCGATCGATCGATCGATCGATCGATCG'
seq += 'ATCGATCATGCTATCATCGATCGATATCGATGCATCGAC'
seq += 'TACTAT'

exon1 = seq[0:62]
exon2 = seq[90:]
intron = seq[63:90]

with open('coding.txt', 'w') as coding:
    coding.write(exon1.upper() + exon2.upper() + '\n')

with open('non-coding.txt', 'w') as non_coding:
    non_coding.write(intron.lower() + '\n')

Lists and Loops

Exercises and Solutions

Processing DNA in a file

The file input.txt contains a number of DNA sequences, one per line. Each sequence starts with the same 14 base pair fragment - a sequencing adapter that should have been removed. Write a program that will (a) trim this adapter and write the cleaned sequences to a new file and (b) print the length of each sequence to the screen.

Solution 1

with open('input.txt') as fin, open('output.txt', 'w') as fout:
    for line in fin:
        # Strip out adapter
        dna = line[14:]

        # Print seq to file
        fout.write(dna)

        # Print len to screen
        print(len(dna))

Multiple exons from genomic DNA

The file genomic_dna.txt contains a section of genomic DNA, and the file exons.txt contains a list of start/stop positions of exons. Each exon is on a separate line and the start and stop positions are separated by a comma. Write a program that will extract the exon segments, concatenate them, and write them to a new file.

Note: You can assume start/stop are 0 based

Solution 1

dna = ''
with open('genomic_dna.txt') as fin:
    for line in fin:
        dna += line.strip()

exons = ''
with open('exons.txt') as fin:
    for line in fin:
        start,stop = line.strip().split(',')

        exons += dna[int(start):int(stop)]

with open('output.txt', 'w') as fout:
    fout.write(exons + '\n')

Writing our own functions

Exercises and Solutions

Percentage of amino acid residues, part 1

Write a function that takes two arguments - a protein sequence and an amino acid residue code - and returns the percentage of the protein that the amino acid makes up. Use the following assertions to test your function:

assert my_function("MSRSLLLRFLLFLLLLPPLP", "M") == 5
assert my_function("MSRSLLLRFLLFLLLLPPLP", "r") == 10
assert my_function("MSRSLLLRFLLFLLLLPPLP", "L") == 50
assert my_function("MSRSLLLRFLLFLLLLPPLP", "Y") == 0

Solution 1

def amino_acid_pct(seq, residue):
    pct = seq.upper().count(residue.upper()) / len(seq)
    return round(pct * 100)

assert amino_acid_pct("MSRSLLLRFLLFLLLLPPLP", "M") == 5
assert amino_acid_pct("MSRSLLLRFLLFLLLLPPLP", "r") == 10
assert amino_acid_pct("MSRSLLLRFLLFLLLLPPLP", "L") == 50
assert amino_acid_pct("MSRSLLLRFLLFLLLLPPLP", "Y") == 0

Percentage of amino acid residues, part 2

Modify the function from part one so that it accepts a list of amino acid residues rather than a single one. If no list is given, the function should return the percentage of hydrophobic amino acid residues (A, I, L, M, F, W, Y and V). Your function should pass the following assertions:

assert my_function("MSRSLLLRFLLFLLLLPPLP", ["M"]) == 5
assert my_function("MSRSLLLRFLLFLLLLPPLP", ['M', 'L']) == 55
assert my_function("MSRSLLLRFLLFLLLLPPLP", ['F', 'S', 'L']) == 70
assert my_function("MSRSLLLRFLLFLLLLPPLP") == 65

Solution 1

def amino_acid_pct(seq, residue = None):
    if residue is None:
        residue = ['A', 'I', 'L', 'M', 'F', 'W', 'Y', 'V']

    count = 0
    for r in residue:
        count += seq.upper().count(r.upper())

    return round((count / len(seq)) * 100)

assert amino_acid_pct("MSRSLLLRFLLFLLLLPPLP", ["M"]) == 5
assert amino_acid_pct("MSRSLLLRFLLFLLLLPPLP", ['M', 'L']) == 55
assert amino_acid_pct("MSRSLLLRFLLFLLLLPPLP", ['F', 'S', 'L']) == 70
assert amino_acid_pct("MSRSLLLRFLLFLLLLPPLP") == 65

Happy Python Hacking!

import sys

hw3due = "2021-03-02 20:00:00"

print(f"Homework 3 due: {hw3due}")
print("Goodbye!")

sys.exit(1)