Functions, STL, and Data Structures

BCH 519
Spring 2019

Andrew E. Bruno
aebruno2@buffalo.edu

Topics Covered

  • Standard Library and Built-in Functions
  • User defined Functions
  • Lists and Dictionaries
  • Exercises and Solutions

The Python Standard Library and Built-in Functions

What is a function?

Python has many built-in functions

print prints objects to stdout
len returns the length (number of items)
reversed Return a reverse iterator (ex. reverse a string)
float convert a string or number to a floating point
str convert an object to a string
range retuns a sequence of integers
abs return the absolute value of a number

https://docs.python.org/3/library/functions.html

What is a module?

  • A module is a collection of related functions/code
  • Modules are “imported” using the import statement
  • Python comes with an extensive collection of modules called the python standard library
  • Modules allow for the reuse and organization of code
  • Can also create your own custom modules and functions

The Python Standard Library

  • Python’s standard library is very extensive
  • Includes modules that provide standardized solutions for many problems that occur in everyday programming
  • https://docs.python.org/3/library/index.html
  • Examples [module].[function]:
string.format format string
string.upper uppercase string
string.strip remove leading and trailing characters
random.randint return a random integer

Exercise 1: Modules and functions

import random

seq = "ATGTAATCGGGTAC"
seq_len = len(seq)      # number of characters in string
seq_lower = seq.lower() # lower case all characters

print("Length: {}".format(seq_len))
print("Lower case: {}".format(seq_lower))

# Generate 5 random numbers between 0 and 100
for i in range(0, 5):
    rand_int = random.randint(0, 100)  # 0 <= rand_int <= 100
    print("{}: {}".format(i, rand_int))

Exercise 1: Output

Then run in the terminal by typing:

$ python ex1.py
Length: 14
Lower case: atgtaatcgggtac
0: 23
1: 84
2: 69
3: 55
4: 45

Exercise 1: What we learned

  • functions typically take an argument list and return one or more values
  • Example: random.randint(A, B)
  • len() returns the length in characters of a string
  • range(0, 5) returns a sequence of five ints: 0, 1, 2, 3, 4
  • import random imports the “random” module from the python standard library

Defining Functions

Functions allow us to reuse blocks of code

Exercise 2: Functions

def power(num, pw):
    result = num ** pw
    return result

sq = power(8, 2)
print("8 squared = {}".format(sq))

cu = power(3, 3)
print("3 cubed = {}".format(cu))

Exercise 2: Output

Then run in the terminal by typing:

$ python ex2.py
8 squared = 64
3 cubed = 27

Exercise 2: What we learned

  • def keyword is used to define your own functions
  • code under the def keyword needs to be indented 4 spaces
  • functions can be passed an argument list
  • values can be returned from a function using the return keyword

Data Structures

Lists and Dictionaries

Lists

  • Used to store a list of values
  • Defined using [] or ()
  • Indicies are 0 based
  • The empty list: `list = []
  • Simple list with one item:
    • list = [32]
    • list = ["DNA Replication"]

Exercise 3: Accessing List Values

nums = [10, 20, 30, 40, 50, 60]

first = nums[0]
second = nums[1]
last = nums[-1]

# string.join() takes in a list of strings
print(",".join([str(first), str(second), str(last)]))

# Extract a "slice". start at index 1 up to index 4
#    0   1   2   3   4   5
#  [10, 20, 30, 40, 50, 60]
#        *   *   * 
sub_list = nums[1:4]
for n in sub_list:
    print(n)

Exercise 3: Output

Then run in the terminal by typing:

$ python ex3.py
10,20,60
20
30
40

Exercise 3: What we learned

  • access items of a list using their index: nums[3]
  • negative index starts at end of list: last item nums[-1] second last item: nums[-2]
  • extract a sublist using the slice notation: nums[1:4]
  • Convert integers to strings using the str() function
  • lists are 0 based
index:        0   1   2   3   4   5 
list:       [10, 20, 30, 40, 50, 60]    len() = 6

Dictionaries

  • Used to store a list of key/value pairs
  • Defined using {}
  • The empty dict: dct = {}
  • Example:
    • dct = {'key': 'value'}

Exercise 4: Accessing Dict items

fruit = {
    'oranges': 10,
    'grapes': 20,
}
# Add new key
fruit['pears'] = 30

keys = fruit.keys()      # fetch all keys as a list
values = fruit.values()  # fetch values keys as a list
print(','.join(keys))
total_grapes = fruit['grapes']
print("Total grapes: {}".format(total_grapes))

for key in fruit:
    value = fruit[key]
    print("{} = {}".format(key, value))

Exercise 4: Output

Then run in the terminal by typing:

$ python ex4.py
pears,grapes,oranges
Total grapes: 20
pears = 30
grapes = 20
oranges = 10

Exercise 4: What we learned

  • dictionaries store a mapping of key/value pairs
  • Access values using key: fruit['grapes']
  • Fetch the list of all keys using: fruit.keys()
  • Iterate over keys: for key in fruit

Printing and manipulating text

Exercises and Solutions

Calculating AT content

Here’s a short DNA sequence:

ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT

Write a program that will print out the AT content of this DNA sequence.

Solution 1

seq = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'
print(seq.count('A') + seq.count('T'))

Solution 2

seq = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'
at_counter = 0
for base in seq:
    if base.lower() == 'a' or base.lower() == 't':
        at_counter = at_counter + 1

print(at_counter)

Solution 3

seq = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'
print(sum(1 for b in seq if b == 'A' or b == 'T'))

Complementing DNA

Here’s a short DNA sequence:

ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT

Write a program that will print the complement of this sequence.

Solution 1

seq = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'
comp = ''
for base in seq:
    if base.upper() == 'A':
        comp = comp + 'T'
    elif base.upper() == 'T':
        comp = comp + 'A'
    elif base.upper() == 'C':
        comp = comp + 'G'
    elif base.upper() == 'G':
        comp = comp + 'C'

print(comp)

Solution 2

seq = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'
base_map = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
comp = ''
for base in seq:
    comp = comp + base_map[base]

print(comp)

# or shorthand version
print(''.join(base_map[base] for base in seq))

Restriction fragment lengths

Here’s a short DNA sequence:

ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT

The sequence contains a recognition site for the EcoRI restriction enzyme, which cuts at the motif G*AATTC (the position of the cut is indicated by an asterisk). Write a program which will calculate the size of the two fragments that will be produced when the DNA sequence is digested with EcoRI.

Solution 1

seq = 'ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT'
index = seq.find('GAATTC')
print('Frag 1 size: {}'.format(len(seq[0:index+1])))
print('Frag 2 size: {}'.format(len(seq[index+1:])))

Solution 2

seq = 'ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT'
frag1_size = seq.find('GAATTC') + 1
print(frag1_size)
print(len(seq) - frag1_size)

Splicing out introns, part one

Here’s a short DNA sequence:

ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTC
GATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGA
TCGATATCGATGCATCGACTACTAT

It comprises two exons and an intron. The first exon runs from the start of the sequence to the sixty-third character, and the second exon runs from the ninety-first character to the end of the sequence. Write a program that will print just the coding regions of the DNA sequence.

Solution 1

seq  = 'ATCGATCGATCGATCGACTGACTAGT'
seq += 'CATAGCTATGCATGTAGCTACTCGATCGATCGATCGA'
seq += 'TCGATCGATCGATCGATCGATCATGC'
seq += 'TATCATCGATCGATATCGATGCATCGACTACTAT'

exon1 = seq[0:62] 
exon2 = seq[90:] 
print(exon1 + exon2)

Splicing out introns, part two

Using the data from part one, write a program that will calculate what percentage of the DNA sequence is coding.

Solution 1

# Only need this for python 2
from __future__ import division

seq  = 'ATCGATCGATCGATCGACTGACTAGTCATAGCTA'
seq += 'TGCATGTAGCTACTCGATCGATCGATCGA'
seq += 'TCGATCGATCGATCGATCGATCATGCTAT'
seq += 'CATCGATCGATATCGATGCATCGACTACTAT'
print( ( len(seq[0:62]) + len(seq[90:]) ) / len(seq) )

Splicing out introns, part three

Using the data from part one, write a program that will print out the original genomic DNA sequence with coding bases in uppercase and non-coding bases in lowercase.

Solution 1

seq  = 'ATCGATCGATCGATCGACTGACTAGTCATAGC'
seq += 'TATGCATGTAGCTACTCGATCGATCGATCGA'
seq += 'TCGATCGATCGATCGATCGATCATGCTATCA'
seq += 'TCGATCGATATCGATGCATCGACTACTAT'
print( seq[0:62].upper() + seq[63:90].lower() + seq[90:].upper() )

Homework #2

Due: 2019-02-19 09:00:00