Functions, STL, and Data Structures

BCH 519
Spring 2017

Andrew E. Bruno
aebruno2@buffalo.edu

Topics Covered

  • Standard Library and Built-in Functions
  • User defined Functions
  • Lists and Dictionaries
  • Exercises and Solutions

The Python Standard Library and Built-in Functions

What is a function?

Functions

Python has many built-in functions

printprints objects to stdout
lenreturns the length (number of items) of an object. Can be a string or list
reversedReturn a reverse iterator (ex. reverse a string or list)
floatconvert a string or number to a floating point
strconvert an object to a string
rangeretuns a sequence of integers
minreturns smallest item in an iterable
maxreturns largest item in an iterable
absreturn the absolute value of a number
http://docs.python.org/2/library/functions.html

What is a module?

  • A module is a collection of related functions/code
  • Modules are "imported" using the import statement
  • Python comes with an extensive collection of modules called the python standard library
  • Modules allow for the reuse and organization of code
  • Can also create your own custom modules and functions

The Python Standard Library

  • Python's standard library is very extensive
  • Includes modules that provide standardized solutions for many problems that occur in everyday programming
  • http://docs.python.org/2/library/index.html
  • Examples [module].[function]:
    string.formatformat string
    string.lowerlowercase string
    string.upperuppercase string
    string.stripremove leading and trailing characters
    random.randintReturn a random integer

Exercise 1: Modules and functions

import random

seq = "ATGTAATCGGGTAC"
seq_len = len(seq)      # number of characters in string
seq_lower = seq.lower() # lower case all characters

print "Length: {}".format(seq_len)
print "Lower case: {}".format(seq_lower)

# Generate 5 random numbers between 0 and 100
for i in range(0, 5):
    rand_int = random.randint(0, 100)  # 0 <= rand_int <= 100
    print "{}: {}".format(i, rand_int)

Run in the terminal by typing:

$ python ex1.py
Length: 14
Lower case: atgtaatcgggtac
0: 23
1: 84
2: 69
3: 55
4: 45

Exercise 1: What we learned

  1. functions typically take an argument list and return one or more values
  2. Example: random.randint(A, B)
  3. len() returns the length in characters of a string
  4. range(0, 5) returns a sequence of five ints: 0, 1, 2, 3, 4
  5. import random imports the "random" module from the python standard library

Defining Functions

Functions allow us to reuse blocks of code

Exercise 2: Functions

def power(num, pw):
    result = num ** pw
    return result

sq = power(8, 2)
print "8 squared = {}".format(sq)

cu = power(3, 3)
print "3 cubed = {}".format(cu)

Run in the terminal by typing:

$ python ex2.py
8 squared = 64
3 cubed = 27

Exercise 2: What we learned

  1. def keyword is used to define your own functions
  2. code under the def keyword needs to be indented 4 spaces
  3. functions can be passed an argument list
  4. values can be returned from a function using the return keyword

Data Structures: Lists and Dictionaries

Lists

  • Used to store a list of values
  • Defined using [] or ()
  • Indicies are 0 based
  • The empty list: list = []
  • Simple list with one item:
    • list = [32]
    • list = ["DNA Replication"]

Exercise 3: Accessing List Values

nums = [10, 20, 30, 40, 50, 60]

first_item = nums[0]
second_item = nums[1]
last_item = nums[-1]

# string.join() takes in a list of strings
print ",".join([str(first_item), str(second_item), str(last_item)])

# Extract a "slice". start at index 1 up to index 4 (not including)
#    0   1   2   3   4   5
#  [10, 20, 30, 40, 50, 60]
#        *   *   * 
sub_list = nums[1:4]
for n in sub_list:
    print n

Run in the terminal by typing:

$ python ex3.py
10,20,60
20
30
40

Exercise 3: What we learned

  1. lists are 0 based
    index:        0   1   2   3   4   5 
    list:       [10, 20, 30, 40, 50, 60]    len() = 6
    
  2. access items of a list using their index: nums[3]
  3. negative index starts at end of list: last item nums[-1] second last item: nums[-2]
  4. extract a sublist using the slice notation: nums[1:4]
  5. Convert integers to strings using the str() function

Dictionaries

  • Used to store a list of key/value pairs
  • Defined using {}
  • The empty dict: dct = {}
  • Example:
    • dct = {'key': 'value'}

Exercise 4: Accessing Dict items

fruit = {
    'oranges': 10,
    'grapes': 20,
}

# Add new key
fruit['pears'] = 30

keys = fruit.keys()      # fetch all keys as a list
values = fruit.values()  # fetch values keys as a list
print ','.join(keys)
total_grapes = fruit['grapes']
print "Total grapes: {}".format(total_grapes)

for key in fruit:
    value = fruit[key]
    print "{} = {}".format(key, value)

Run in the terminal by typing:

$ python ex4.py
pears,grapes,oranges
Total grapes: 20
pears = 30
grapes = 20
oranges = 10

Exercise 4: What we learned

  1. dictionaries store a mapping of key/value pairs
  2. Access values using key: fruit['grapes']
  3. Fetch the list of all keys using: fruit.keys()
  4. Iterate over keys: for key in fruit

Chapter 2

Exercises and Solutions

Printing and manipulating text

Exercises

Calculating AT content

  • Here's a short DNA sequence:
    ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT
    Write a program that will print out the AT content of this DNA sequence.

Solution 1

seq = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'
print seq.count('A') + seq.count('T')

Solution 2

seq = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'
at_counter = 0
for base in seq:
    if base.lower() == 'a' or base.lower() == 't':
        at_counter = at_counter + 1

print at_counter

Solution 3

seq = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'
print sum(1 for b in seq if b == 'A' or b == 'T')

Complementing DNA

  • Here's a short DNA sequence:
    ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT
    Write a program that will print the complement of this sequence.

Solution 1

seq = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'
comp = ''
for base in seq:
    if base.upper() == 'A':
        comp = comp + 'T'
    elif base.upper() == 'T':
        comp = comp + 'A'
    elif base.upper() == 'C':
        comp = comp + 'G'
    elif base.upper() == 'G':
        comp = comp + 'C'

print comp

Solution 2

seq = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'
base_map = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
comp = ''
for base in seq:
    comp = comp + base_map[base]

print comp

# or shorthand version
print ''.join(base_map[base] for base in seq)

Restriction fragment lengths

  • Here's a short DNA sequence:
    ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT
    The sequence contains a recognition site for the EcoRI restriction enzyme, which cuts at the motif G*AATTC (the position of the cut is indicated by an asterisk). Write a program which will calculate the size of the two fragments that will be produced when the DNA sequence is digested with EcoRI.

Solution 1

seq = 'ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT'
index = seq.find('GAATTC')
print 'Frag 1 size: {}'.format(len(seq[0:index+1]))
print 'Frag 2 size: {}'.format(len(seq[index+1:]))

Solution 2

seq = 'ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT'
frag1_size = seq.find('GAATTC') + 1
print frag1_size
print len(seq) - frag1_size 

Splicing out introns, part one

  • Here's a short section of genomic DNA:
    ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTC
    GATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGA
    TCGATATCGATGCATCGACTACTAT
    It comprises two exons and an intron. The first exon runs from the start of the sequence to the sixty-third character, and the second exon runs from the ninety-first character to the end of the sequence. Write a program that will print just the coding regions of the DNA sequence.

Solution 1

seq  = 'ATCGATCGATCGATCGACTGACTAGT'
seq += 'CATAGCTATGCATGTAGCTACTCGATCGATCGATCGA'
seq += 'TCGATCGATCGATCGATCGATCATGC'
seq += 'TATCATCGATCGATATCGATGCATCGACTACTAT'

exon1 = seq[0:62] 
exon2 = seq[90:] 
print  exon1 + exon2

Splicing out introns, part two

  • Using the data from part one, write a program that will calculate what percentage of the DNA sequence is coding.

Solution 1

from __future__ import division
seq  = 'ATCGATCGATCGATCGACTGACTAGTCATAGCTA'
seq += 'TGCATGTAGCTACTCGATCGATCGATCGA'
seq += 'TCGATCGATCGATCGATCGATCATGCTAT'
seq += 'CATCGATCGATATCGATGCATCGACTACTAT'
print ( len(seq[0:62]) + len(seq[90:]) ) / len(seq)

Splicing out introns, part three

  • Using the data from part one, write a program that will print out the original genomic DNA sequence with coding bases in uppercase and non-coding bases in lowercase.

Solution 1

seq  = 'ATCGATCGATCGATCGACTGACTAGTCATAGC'
seq += 'TATGCATGTAGCTACTCGATCGATCGATCGA'
seq += 'TCGATCGATCGATCGATCGATCATGCTATCA'
seq += 'TCGATCGATATCGATGCATCGACTACTAT'
print  seq[0:62].upper() + seq[63:90].lower() + seq[90:].upper()

Homework #2

  • Due: 2017-02-21 09:00:00