## Functions, STL, and Data Structures

#### BCH 519Spring 2018

Andrew E. Bruno
aebruno2@buffalo.edu

### Topics Covered

• Standard Library and Built-in Functions
• User defined Functions
• Lists and Dictionaries
• Exercises and Solutions

## The Python Standard Library and Built-in Functions

### Python has many built-in functions

 print prints objects to stdout len returns the length (number of items) of an object. Can be a string or list reversed Return a reverse iterator (ex. reverse a string or list) float convert a string or number to a floating point str convert an object to a string range retuns a sequence of integers min returns smallest item in an iterable max returns largest item in an iterable abs return the absolute value of a number http://docs.python.org/2/library/functions.html

### What is a module?

• A module is a collection of related functions/code
• Modules are "imported" using the `import` statement
• Python comes with an extensive collection of modules called the python standard library
• Modules allow for the reuse and organization of code
• Can also create your own custom modules and functions

### The Python Standard Library

• Python's standard library is very extensive
• Includes modules that provide standardized solutions for many problems that occur in everyday programming
• http://docs.python.org/2/library/index.html
• Examples [module].[function]:  string.format format string string.lower lowercase string string.upper uppercase string string.strip remove leading and trailing characters random.randint Return a random integer

### Exercise 1: Modules and functions

``````import random

seq = "ATGTAATCGGGTAC"
seq_len = len(seq)      # number of characters in string
seq_lower = seq.lower() # lower case all characters

print "Length: {}".format(seq_len)
print "Lower case: {}".format(seq_lower)

# Generate 5 random numbers between 0 and 100
for i in range(0, 5):
rand_int = random.randint(0, 100)  # 0 <= rand_int <= 100
print "{}: {}".format(i, rand_int)
``````

### Run in the terminal by typing:

``````\$ python ex1.py
Length: 14
Lower case: atgtaatcgggtac
0: 23
1: 84
2: 69
3: 55
4: 45
``````

### Exercise 1: What we learned

1. functions typically take an argument list and return one or more values
2. Example: `random.randint(A, B)`
3. `len()` returns the length in characters of a string
4. `range(0, 5)` returns a sequence of five ints: 0, 1, 2, 3, 4
5. `import random` imports the "random" module from the python standard library

## Defining Functions

### Exercise 2: Functions

``````def power(num, pw):
result = num ** pw
return result

sq = power(8, 2)
print "8 squared = {}".format(sq)

cu = power(3, 3)
print "3 cubed = {}".format(cu)
``````

### Run in the terminal by typing:

``````\$ python ex2.py
8 squared = 64
3 cubed = 27
``````

### Exercise 2: What we learned

1. `def` keyword is used to define your own functions
2. code under the `def` keyword needs to be indented 4 spaces
3. functions can be passed an argument list
4. values can be returned from a function using the `return` keyword

## Data Structures: Lists and Dictionaries

### Lists

• Used to store a list of values
• Defined using `[] or ()`
• Indicies are 0 based
• The empty list: `list = []`
• Simple list with one item:
• `list = [32]`
• `list = ["DNA Replication"]`

### Exercise 3: Accessing List Values

``````nums = [10, 20, 30, 40, 50, 60]

first_item = nums[0]
second_item = nums[1]
last_item = nums[-1]

# string.join() takes in a list of strings
print ",".join([str(first_item), str(second_item), str(last_item)])

# Extract a "slice". start at index 1 up to index 4 (not including)
#    0   1   2   3   4   5
#  [10, 20, 30, 40, 50, 60]
#        *   *   *
sub_list = nums[1:4]
for n in sub_list:
print n
``````

### Run in the terminal by typing:

``````\$ python ex3.py
10,20,60
20
30
40
``````

### Exercise 3: What we learned

1. lists are 0 based
``````index:        0   1   2   3   4   5
list:       [10, 20, 30, 40, 50, 60]    len() = 6
``````
2. access items of a list using their index: `nums[3]`
3. negative index starts at end of list: last item `nums[-1]` second last item: `nums[-2]`
4. extract a sublist using the slice notation: `nums[1:4]`
5. Convert integers to strings using the `str()` function

### Dictionaries

• Used to store a list of key/value pairs
• Defined using `{}`
• The empty dict: `dct = {}`
• Example:
• `dct = {'key': 'value'}`

### Exercise 4: Accessing Dict items

``````fruit = {
'oranges': 10,
'grapes': 20,
}

fruit['pears'] = 30

keys = fruit.keys()      # fetch all keys as a list
values = fruit.values()  # fetch values keys as a list
print ','.join(keys)
total_grapes = fruit['grapes']
print "Total grapes: {}".format(total_grapes)

for key in fruit:
value = fruit[key]
print "{} = {}".format(key, value)
``````

### Run in the terminal by typing:

``````\$ python ex4.py
pears,grapes,oranges
Total grapes: 20
pears = 30
grapes = 20
oranges = 10
``````

### Exercise 4: What we learned

1. dictionaries store a mapping of key/value pairs
2. Access values using key: `fruit['grapes']`
3. Fetch the list of all keys using: `fruit.keys()`
4. Iterate over keys: `for key in fruit`

## Printing and manipulating text

### Calculating AT content

• Here's a short DNA sequence:
``ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT``
Write a program that will print out the AT content of this DNA sequence.

### Solution 1

``````seq = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'
print seq.count('A') + seq.count('T')
``````

### Solution 2

``````seq = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'
at_counter = 0
for base in seq:
if base.lower() == 'a' or base.lower() == 't':
at_counter = at_counter + 1

print at_counter
``````

### Solution 3

``````seq = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'
print sum(1 for b in seq if b == 'A' or b == 'T')
``````

### Complementing DNA

• Here's a short DNA sequence:
``ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT``
Write a program that will print the complement of this sequence.

### Solution 1

``````seq = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'
comp = ''
for base in seq:
if base.upper() == 'A':
comp = comp + 'T'
elif base.upper() == 'T':
comp = comp + 'A'
elif base.upper() == 'C':
comp = comp + 'G'
elif base.upper() == 'G':
comp = comp + 'C'

print comp
``````

### Solution 2

``````seq = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'
base_map = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
comp = ''
for base in seq:
comp = comp + base_map[base]

print comp

# or shorthand version
print ''.join(base_map[base] for base in seq)
``````

### Restriction fragment lengths

• Here's a short DNA sequence:
``ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT``
The sequence contains a recognition site for the EcoRI restriction enzyme, which cuts at the motif G*AATTC (the position of the cut is indicated by an asterisk). Write a program which will calculate the size of the two fragments that will be produced when the DNA sequence is digested with EcoRI.

### Solution 1

``````seq = 'ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT'
index = seq.find('GAATTC')
print 'Frag 1 size: {}'.format(len(seq[0:index+1]))
print 'Frag 2 size: {}'.format(len(seq[index+1:]))
``````

### Solution 2

``````seq = 'ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT'
frag1_size = seq.find('GAATTC') + 1
print frag1_size
print len(seq) - frag1_size
``````

### Splicing out introns, part one

• Here's a short section of genomic DNA:
``````ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTC
GATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGA
TCGATATCGATGCATCGACTACTAT``````
It comprises two exons and an intron. The first exon runs from the start of the sequence to the sixty-third character, and the second exon runs from the ninety-first character to the end of the sequence. Write a program that will print just the coding regions of the DNA sequence.

### Solution 1

``````seq  = 'ATCGATCGATCGATCGACTGACTAGT'
seq += 'CATAGCTATGCATGTAGCTACTCGATCGATCGATCGA'
seq += 'TCGATCGATCGATCGATCGATCATGC'
seq += 'TATCATCGATCGATATCGATGCATCGACTACTAT'

exon1 = seq[0:62]
exon2 = seq[90:]
print  exon1 + exon2
``````

### Splicing out introns, part two

• Using the data from part one, write a program that will calculate what percentage of the DNA sequence is coding.

### Solution 1

``````from __future__ import division
seq  = 'ATCGATCGATCGATCGACTGACTAGTCATAGCTA'
seq += 'TGCATGTAGCTACTCGATCGATCGATCGA'
seq += 'TCGATCGATCGATCGATCGATCATGCTAT'
seq += 'CATCGATCGATATCGATGCATCGACTACTAT'
print ( len(seq[0:62]) + len(seq[90:]) ) / len(seq)
``````

### Splicing out introns, part three

• Using the data from part one, write a program that will print out the original genomic DNA sequence with coding bases in uppercase and non-coding bases in lowercase.

### Solution 1

``````seq  = 'ATCGATCGATCGATCGACTGACTAGTCATAGC'
seq += 'TATGCATGTAGCTACTCGATCGATCGATCGA'
seq += 'TCGATCGATCGATCGATCGATCATGCTATCA'
seq += 'TCGATCGATATCGATGCATCGACTACTAT'
print  seq[0:62].upper() + seq[63:90].lower() + seq[90:].upper()
``````

## Homework #2

• Due: 2018-02-20 09:00:00