Lecture 2 - Comparisons

Comparing sequences (my focus although approach applies also to restriction data, digital scores, continuous characters)
Need 1st to have a sequence. Some of you already do. To handle it and use for searches need editor. Will show some for seq's but can use any good editor.

Homology: Usual assumption of molecular biologists leads to some terminology

Assumption is similarity = homology

But homology ==> related by descent so Fitch emphasizes proper term = similarity

Gene duplications --> problem too

if comparing properly aligned = orthology

i.e., A and A' in species 1; B and B' in species 2; A and B or A' and B' are orthologous

if comparing misaligned = parology

i.e., A and A' in species 1; B and B' in species 2; A and B' or A' and B are parologous

Fitch intended ortho = correct & para = parallel but used Greek route for crazy

Note morphologists have even more distinctions including recent molecular buzzword homeotic

Big issue = how to align: for proteins with known 3° structure like globins, can use 3-D match

e.g., a , b , g and d human globin

a V-LSPADKTNVKAAWGKVGAHAGGYGA...

g GHFTEEDKATITSLWGKVNV--EDAGG...

b VHLTPEEKSAVTALWGKVNV--DEVGG...

d VHLTPEEKTAVNALWGKVNV--DAVGG...

Assign gaps where structure differs (e.g., indel at turn at 20; alpha-helix indel [Mb shows del in alpha at 53/54; but del at 2 for alpha just by maximizing similarity)

but unavailable for many proteins and not applicable (except indirectly to exons) for DNA/RNA

Dot matrix = another approach

Gibbs & McIntyre (1970) introduced

Algorithm fundamentally is put seq1 on Y-axis, seq2 on X; if identical at point (x,y) then put a dot else leave blank

Sometime used as described for proteins but usually filtered

Must be filtered for DNA/RNA - Maizel&Lenk (1981)

Watch demo aligning rat and mouse ß globin genes with dot for 8 21/30. Note:

indels and gaps in intron 2

repeat in intron 1

the demo uses vostorg's DOTMAP

Initial version searched via axes but White et al (1984) group introduced diagonal search

Pustell&Kafatos (1982) modified to dampen short diagonals and show stringencies via letters (A best, B next, etc) or colors

Staden (1982) also fixed protein version to allow for similarities (mutation matrix introduced by Dayhoff) among amino acids

Global or local alignments

Needleman & Wunsch (1970) found a dynamic programming algorithm. 2 matrices: D = distances and T = traceback with sequences A and B and elementsAi and Bj. Each Dij for A1 . . . Ai and B1 . . . Bj will contain a measure of the minimal distance and the info in Tij will make it feasible to generate the path thru D for the alignment. (If this is incomprehensible see the paper or just appreciate that it works.) The algorithm is implemented in ALSEQ.

Scoring for N-W assigns 0 for a match, 1 for a mismatch and a gap weight. See below for more on gap weighting.

Sellers (1974) developed a distinct but related algorithm. Because it treats a misalignment at the end as a gap, I won't even try to review it.

Gotoh (1982) modified N-W to search diagonally rather than along matrix axes.

Issues

N-W-S-G find only an optimal alignment; may be wiser to find all. Also a slightly suboptimal alignment could be biologically more useful than the optimal ones. There are algorithms but not commonly used.

Gap penalty influences outcome. Most programs use a constant value, g, or one proportional to the length, n, of the gap (r*n). Fitch asserts that proper penalty should be g + r*n with g>r.

Fitch & Smith (1983) tested various gap penalties and showed results depend on penalty.

Hashing or Chunking for speed

Wilbur & Lipman (1983) and Lipman & Pearson (1985) FASTA & FASTP make k-tuple look up tables (like Bible concordance), e.g., doublets in DNA = AA, AT, AG, AC, etc for 16 possible. Table would state that AA occurs at say 2, 23, 45, etc. Table for seq1 compared to hash tables for whole data base and highest scores picked.

BLAST on file server for Genbank (I think modified hasher) even faster but less effective.

Blast/Beauty is another url I have used. Others available include:
Another Blast/Beauty launcher
Fastpat

Can also search by E-mail Support for Lecture 2

Evaluating significance

2 seq of 1000 elements --> 10^767 possible alignments with a priori P for a valuable alignment ~ 10^-600

Some assume Poisson distribution

Some assume all changes at all sites equally probable

Some use Monte Carlo methods

All appear to assume normal distribution or something close but may not apply

Values in literature have less absolute meaning than may appear because of such assumptions

Wisest approach may be to use several gap penalties, obtain several 'optimal' and suboptimal alignments and compare their P's with same method

Multiple alignments

N-W could be used in principle but t(ime) ~ (2^L)*n^L makes it too inefficient

Various improvements by Murata et al (1985), Gotoh (1986) and Karlin et al (1983) (1st 2 reducing order for 3 seq; last by hashing and similarity score) but need shortcuts--

CLUSTAL - Higgins & Sharp (1988 & 89) - & CLUSTALV - Higgins (1992) - developed an heuristic algorithm that I think could be given a theoretical basis:

generate pairwise comparison scores (FASTA)

make UPGMA tree (UPGMA later in phylogeny generation)

anneal pairwise per tree with N-W alignment

I've modified by using NJ tree (also later but much better than UPGMA - maybe best) and iteration

The program is available to the class; see CLUSTALV

MULTALIN - Corpet (1988) -algorithm similar but nice interface; also available at MULTALIN

MUSEQAL - Berger & Munson (1991) -based on Murata above but with random, iterative search instead of exhaustive; available at MUSEQAL

ESEE - Cabot & Beckenbach (1989) - the eyeball sequence editor leaves judgment to you; see ESEE

Send mail to Michael Garrick with questions or comments about this web site. Last modified: September 03, 1998

Send mail to Michael Garrick with questions or comments about this web site.
Last modified: September 03, 1998