|
|
Comparing sequences (my focus although approach applies also to
restriction data, digital scores, continuous characters)
Need 1st to have a sequence. Some of you already do. To handle it and use for searches need editor. Will show some for seq's but can use any good editor. Homology: Usual assumption of molecular biologists leads to some terminology
if comparing properly aligned = orthology i.e., A and A' in species 1; B and B' in species 2; A and B or A' and B' are orthologous if comparing misaligned = parology i.e., A and A' in species 1; B and B' in species 2; A and B' or A' and B are parologous Fitch intended ortho = correct & para = parallel but used Greek route for crazy Note morphologists have even more distinctions including recent molecular buzzword homeotic Big issue = how to align: for proteins with known 3° structure like globins, can use 3-D match
Assign gaps where structure differs (e.g., indel at turn at 20; alpha-helix indel [Mb shows del in alpha at 53/54; but del at 2 for alpha just by maximizing similarity)but unavailable for many proteins and not applicable (except indirectly to exons) for DNA/RNA Dot matrix = another approach
Initial version searched via axes but White et al (1984) group introduced diagonal search Pustell&Kafatos (1982) modified to dampen short diagonals and show stringencies via letters (A best, B next, etc) or colors Staden (1982) also fixed protein version to allow for similarities (mutation matrix introduced by Dayhoff) among amino acids Global or local alignments
Hashing or Chunking for speed
Can also search by E-mail Support for Lecture 2 Evaluating significance 2 seq of 1000 elements --> 10 ^767 possible alignments with a priori P for a valuable alignment ~ 10^-600Some assume Poisson distribution Some assume all changes at all sites equally probable Some use Monte Carlo methods All appear to assume normal distribution or something close but may not apply Values in literature have less absolute meaning than may appear because of such assumptions Wisest approach may be to use several gap penalties, obtain several 'optimal' and suboptimal alignments and compare their P's with same method Multiple alignments ^L)*n^L makes it too inefficient Various improvements by Murata et al (1985), Gotoh (1986) and Karlin et al (1983) (1st 2 reducing order for 3 seq; last by hashing and similarity score) but need shortcuts-- CLUSTAL - Higgins & Sharp (1988 & 89) - & CLUSTALV - Higgins (1992) - developed an heuristic algorithm that I think could be given a theoretical basis: generate pairwise comparison scores (FASTA) make UPGMA tree (UPGMA later in phylogeny generation) anneal pairwise per tree with N-W alignment I've modified by using NJ tree (also later but much better than UPGMA - maybe best) and iteration The program is available to the class; see CLUSTALV MULTALIN - Corpet (1988) -algorithm similar but nice interface; also available at MULTALIN MUSEQAL - Berger & Munson (1991) -based on Murata above but with random, iterative search instead of exhaustive; available at MUSEQAL ESEE - Cabot & Beckenbach (1989) - the eyeball sequence editor leaves judgment to you; see ESEE |
Send mail to Michael
Garrick with questions or comments about this web site.
|