|
|
|
eta |
alpha |
|
no rate heterogeneity |
0.0 |
infinity |
weak rate heterogeneity |
< 0.5 |
>1.0 |
strong rate heterogeneity |
> 0.5 |
<1.0 |
maximum rate heterogeneity |
1.0 |
0.0 |
In addition, the total rate heterogeneity (rho) of the "invariable sites + Gamma" model can then be written
rho = theta + rho - theta rho
All options available can be selected and changed after PUZZLE has read the input file. Depending on the input files options are preselected and displayed in a menu ("PHYLIP look and feel"):
GENERAL OPTIONS b Type of analysis? Tree reconstruction k Tree search procedure? Quartet puzzling n Number of puzzling steps? 1000 u Show unresolved quartets? No o Display as outgroup? Gibbon a Parameter estimates? Approximate (faster) x Parameter estimation uses? Neighbor-joining tree SUBSTITUTION PROCESS d Type of sequence input data? Nucleotides h Codon positions selected? Use all positions m Model of substitution? HKY (Hasegawa et al. 1985) t Transition/transversion parameter? Estimate from data set f Nucleotide frequencies? Estimate from data set RATE HETEROGENEITY w Model of rate heterogeneity? Uniform rate Confirm [y] or change [menu] settings:
By typing the letters shown in the menu you can either change settings or enter new parameters. Some options (for example "m" and "w") can be invoked several times to switch through a number of different settings. The parameters of the models of sequence evolution can be estimated from the data by a variety of procedures based on maximum likelihood. The analysis is started by typing "y" at the input prompt.
The following table lists in alphabetical order all PUZZLE options. Be aware, however,
not all of them are accessible at the same time:
Option |
Description |
a |
Approximation option. Determines whether an approximate or the exact likelihood function is used to estimate parameters of the models of sequence evolution. The approximate likelihood function is in most cases sufficient and is faster. |
b |
Type of analysis. Allows to switch between tree reconstruction by maximum likelihood and likelihood mapping. |
c |
Number of rate categories for the discrete Gamma distribution as used in the analysis of rate heterogeneity. |
d |
Data type. Specifies whether nucleotide or amino acid sequences serve as input. Is automatically selected by inspection of the input data. |
e |
Gamma rate heterogeneity parameter eta. Equals 1/(1+alpha) where alpha is the usual Gamma distribution parameter. For a precise definition of eta please read the section about models of sequence evolution in this manual. |
f |
Base frequencies. The maximum likelihood calculation needs the frequency of each nucleotide (amino acid, doublet) as input. PUZZLE estimates these values from the sequence input data. This option allows specification of other values. |
g |
Group sequences in clusters. Allows to define clusters of sequences as needed for the likelihood mapping analysis. Only available when likelihood mapping is selected ("b"). |
h |
Codon positions. Only available if the number of positions in a nucleotide alignment is a multiple of three. Allows to select each of the three codon positions alone as well as the 1st and 2nd positions together. |
i |
Fraction of invariable sites. Probability of a site to be invariable. This parameter can be estimated from the data by PUZZLE (only if the approximation option for the likelihood function is turned off). Please read the section about models of sequences evolution in this manual to get a precise definition of this parameter. |
k |
Tree search. Determines how the overall tree is obtained. The topology is either computed with the quartet puzzling algorithm or is defined by the user. Maximum likelihood branch lengths will be computed for this tree. Alternatively, a maximum likelihood distance matrix only can also be computed (no overall tree). |
m |
Model of substitution. The following models are implemented for nucleotides: the Tamura-Nei (TN) model, the Hasegawa et al. (HKY) model, and the Schöniger-von Haeseler (SH) model. The SH model describes the evolution of pairs of dependent nucleotides (pairs are the first and the second nucleotide, the third and the fourth nucleotide and so on). It allows for specification of the transition-transversion ratio. The originally proposed model (Schöniger-von Haeseler 1994) is obtained by setting the transition-transversion parameter to 0.5. The Jukes-Cantor (1969), the Felsenstein (1981), and the Kimura (1980) model are all special cases of the HKY model. For amino acid sequence data the Dayhoff et al. (Dayhoff) model, the Jones et al. (JTT) model, and the Adachi and Hasegawa (mtREV24) substitution model are implemented in PUZZLE. The mtREV24 model describes the evolution of amino acids encoded on mtDNA. For more information please read the section in this manual about models of sequence evolution. See also option "w" (model of rate heterogeneity). |
n |
Number of puzzling steps. Parameter of the quartet puzzling tree search (meaning comparable to the number of bootstrap replicates). Generally, the more sequences are used the more puzzling steps are advised but the default value 1000 should be OK for most data sets. |
o |
Outgroup. For displaying purposes of the unrooted quartet puzzling tree only. The default outgroup is the first sequence of the data set. |
p |
Constrain the TN model to the F84 model. This option is only available for the Tamura-Nei model. With this option the expected (!) transition-transversion ratio for the F84 model can be entered, and PUZZLE computes the corresponding parameters of the TN model (this depends on the data file!). This allows to compare the results of PUZZLE and the PHYLIP maximum likelihood programs which use the F84 model. |
r |
Y/R transition parameter. This option is only available for the TN model. This parameter is the ratio of the rates for pyrimidine transitions and purine transitions. You don't need to specify this parameter as PUZZLE can estimate it from the given data set. For precise definition pleas read the section in this manual about models of sequence evolution. |
s |
Symmetrize doublet frequencies. This option is only available for the SH model. With this option the doublet frequencies are symmetrized. For example, the frequencies of "AT" and "TA" are set to the average of both frequencies. |
t |
Transition/transversion parameter. For nucleotide data only. You don't need to specify this parameter as PUZZLE can estimate it from the data. The precise definition of this parameter is given in the section on models of sequence evolution in this manual. For most data sets it is numerically very similar to the expected transition-transversion ratio. |
u |
Show unresolved quartets. During the quartet puzzling tree search PUZZLE counts the number of unresolved quartet trees. An unresolved quartet is a quartet where the maximum likelihood values for each of the three possible quartet topologies are so similar that it is not possible to prefer one of them (Strimmer, Goldman, and von Haeseler 1996). If this option is selected you'll get a detailed list of all starlike quartets. Note, for some data sets there may be a lot of unresolved quartets. In this case a list of all unresolved quartets is probably not very useful and also needs a lot of disk space. |
w |
Model of rate heterogeneity. PUZZLE provides several different models of rate heterogeneity: uniform rate over all sites (rate homogeneity), Gamma distributed rates, two rates (1 invariable + 1 variable), and a mixed model (1 invariable rate + Gamma distributed rates). All necessary parameters can be estimated by PUZZLE. Note that whenever invariable sites are taken into account the parameter estimation will invoke the "a" option to use an exact likelihood function. For more detailed information please read the section in this manual about models of sequence evolution. See also option "m" (model of substitution). |
x |
Selects the methods used in the estimation of the model parameters. Substitution process estimates can be obtained through quartet subsampling as well as by using an overall neighbor-joining tree while rate heterogeneity parameters can only be computed through the neighbor-joining tree approach. |
For nucleotide data PUZZLE computes the expected transition/transversion ratio and the expected pyrimidine transition/purine transition ratio corresponding to the selected model. Base frequencies play an important role in the calculation of these ratios.
PUZZLE also tests with a 5% level chi-square-test whether the base composition of each sequence is identical to the average base composition of the whole alignment. All sequences with deviating composition are listed in the output file. It is desired that no sequence (possibly except for the outgroup) has a deviating base composition. Otherwise a basic assumption implicit in the maximum likelihood calculation is violated.
A hidden feature of PUZZLE (since version 2.5) is the employment of an advanced weighting scheme of quartets (Strimmer, Goldman, and von Haeseler 1997) in the quartet puzzling tree search.
PUZZLE also computes the average distance between all pairs of sequences (maximum likelihood distances). The average distances can be viewed as a rough measure for the overall sequence divergence.
The quartet puzzling (QP) tree search estimates support values for each internal branch. In principle, these values have the same practical meaning as bootstrap values. Indeed, it turns out that PUZZLE gives you estimates of support that are even numerically very similar to corresponding neighbor-joining bootstrap values. This means that branches showing a QP reliability from 90% to 100% are very strongly supported. In principle one can of course also trust branches with lower reliability but in this case it is advisable to check how well the respective branch does in comparison to other branches in the tree (relative reliability). It is also important if you have a branch with a low confidence to check the alternative groupings that are not included in the QP tree (they are all listed in the outfile!). There should be a significant gap between the lowest reliability value of the QP tree and the most frequent grouping that is not included in the QP tree. For example, if you have a support value of 60% and the not-included grouping occurs with a frequency of 20% then the 60% support for the branch is OK.
PUZZLE computes the number and the percentage of completely unresolved maximum likelihood quartets. An unresolved quartet is a quartet where the maximum likelihood values for each of the three possible quartet topologies are so similar that it is not possible to prefer one of them (Strimmer, Goldman, von Haeseler 1996). The percentage of the unresolved quartets among all possible quartets is an indicator of the suitability of the data for phylogenetic analysis. A high percentage usually results in a highly multifurcating quartet puzzling tree. If you have only few unresolved quartets we recommend to invoke option "u" to get a list of all these quartets. In a likelihood mapping analysis the percentage of completely unresolved quartets is shown in the central basin of the triangle diagram.
PUZZLE can estimate both the parameters involved in the models of substitution (TN,
HKY) and in the model of rate variation (Gamma distribution, fraction of invariable sites)
without prior knowledge of an overall tree by a number of different strategies based on
maximum likelihood (Strimmer and von Haeseler, submitted). For all estimated parameters a
corresponding standard error (S.E.) is computed. In most cases the results obtained are
very satisfactory. However, if you have good arguments to choose a different set of
parameters than the values obtained by PUZZLE don't hesitate to use them. If sequences are
extremly similar it is very hard for every algorithm to extract information about the
model from the data. Also, be careful if the estimated parameter values are very close to
the internal upper and lower bounds:
Parameter (Symbol) | Minimal Value | Maximal Value |
Transition/transversion parameter (kappa) | 0.20 | 30.00 |
Y/R transition parameter (tau) | 0.10 | 6.00 |
Fraction of invariable sites (theta) | 0.00 | 0.99 |
Gamma rate heterogeneity parameter (eta) | 0.01 | 0.99 |
Likelihood mapping is a method to analyse the support for internal branches in a tree without having to compute an overall tree. Every internal branch in an a completely resolved tree defines up to four clusters of sequences. If one is interested in the relation of these groups a likelihood mapping analysis is adequate. Thus, only prior knowledge of the corresponding clusters is necessary. The likelihood mapping diagrams (as contained in various output files generated by PUZZLE) will then illucidate the possible relationships in detail. More about likelihood mapping will be published elsewhere (Strimmer and von Haeseler 1997).
PUZZLE has a built-in limit to allow data sets only up to 257 sequences in order to avoid overflow of internal integer variables. At least 32767 sites should be possible depending on the compiler used. Computation time will be the largest constraint even if sufficient computer memory is available. If rate heterogeneity is taken into account every additional category slows down the overall computation by the amount of time needed for one complete run assuming rate homogeneity.
If problems are encountered PUZZLE terminates program execution and returns a plain
text error message. Depending on the severity errors are classified into three groups:
"HALT " errors: | Very severe. You should never ever see one of these messages. If so, please contact the developers! |
"Unable to proceed" errors: | Harmless but annoying. Mostly memory errors (not enough RAM) or problems with the format of the input files. |
Other errors: | Completely uncritical. Occur mostly when options of PUZZLE are being set. |
A standard machine (1996 UNIX workstation) with 32 to 64 MB RAM PUZZLE can easily do maximum likelihood tree searches including estimation of support values for data sets with 50-100 sequences. More sequences are possible but probably not very useful (star tree!). As likelihood mapping is not memory consuming and computationally quite fast it can be applied to large data sets as well.
There are a number of other very useful and widespread programs to reconstruct
phylogenetic relationships and to analyse molecular sequence data that are available free
of charge. Here are the URLS of some web pages that provide links to most of them
(including the PHYLIP, MOLPHY, and PAML maximum likelihood programs):
The maximum likelihood kernel of PUZZLE is an offspring of the program NucML/ProtML version 2.2 by Jun Adachi and Masami Hasegawa (ftp://sunmh.ism.ac.jp/pub/molphy). We thank them for generously allowing us to use the source code of their program. The maze as icon for PUZZLE was suggested by Joe Felsenstein. Hans Zischler reported a problem with the input tree routine of previous versions of PUZZLE and Katja Nieselt-Struwe helped to improve the EPSF code. We thank Michael Schöniger and Matthias Krings for beta testing and José Castresana for making PUZZLE run under VMS. We thank Catherine Letondal and Liz Bailes for helping to correct a problem of PUZZLE 3.0 on DEC Alpha machines. Finally, we would like to thank the European Bioinformatics Institute (EBI) and the Institut Pasteur for kindly distributing the PUZZLE program and the Deutsche Forschungsgemeinschaft (DFG) for financial support.
Adachi, J. and M. Hasegawa. 1996. MOLPHY: programs for molecular phylogenetics, version 2.3. Institute of Statistical Mathematics, Tokyo.
Adachi, J. and M. Hasegawa. 1996. Model of amino acid substitution in proteins encoded by mitochondrial DNA. J. Mol. Evol. 42: 459-468.
Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. 1978. A model of evolutionary change in proteins. In: Dayhoff, M. O. (ed.) Atlas of Protein Sequence Structur, Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington DC, pp. 345-352.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17: 368-76.
Felsenstein, J. 1993. PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Department of Genetics, University of Washington, Seattle.
Felsenstein, J. and G.A. Churchill. 1996. A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13: 93-104.
Hasegawa, M., H. Kishino, and K. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22: 160-174.
Jukes, T. H. and C. R. Cantor. 1969. Evolution of protein molecules. In: Munro, H. N. (ed.) Mammalian Protein Metabolism, New York: Academic Press, pp. 21-132.
Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. CABIOS 8: 275-282.
Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16: 111-120.
Tamura, K. and M. Nei. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10: 512-526.
Tamura K. 1994. Model selection in the estimation of the number of nucleotide substitutions. Mol. Biol. Evol. 11: 154-157.
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucl. Acids Res. 22: 4673-4680.
Saitou, N. and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4: 1406-425.
Schöniger, M. and A. von Haeseler. 1994. A stochastic model for the evolution of autocorrelated DNA sequences. Mol. Phyl. Evol. 3: 240-247.
Strimmer, K. and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13: 964-969.
Strimmer, K., N. Goldman, and A. von Haeseler. 1997. Bayesian probabilities and quartet puzzling. Mol. Biol. Evol. 14: 210-211.
Strimmer, K. and A. von Haeseler. 1997. Likelihood-mapping: a simple method to visualize phylogenetic content of a sequence alignment. PNAS (USA). 94: 6815-6819.
Strimmer, K. and A. von Haeseler. 1997. Parameter estimation for models of sequence evolution. Genetics. Submitted.
Yang, Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306-314.
The PUZZLE program has first been distributed in 1995. Since then it has been
continually improved. Here is a list of the most important changes.
3.1 | Much improved user interface to rate heterogeneity (less confusing menu, rearranged outfile, additional out-of-range check). Possibility to read rooted input trees (automatic removal of basal bifurcation). Computation of average distance between all pairs of sequences. Fix of a bug that caused PUZZLE 3.0 to crash on some systems (DEC Alpha). Cosmetic changes in program and documentation. |
3.0 | Rate heterogeneity included in all models of substitution (Gamma distribution plus invariable sites). Likelihood mapping analysis with Postscript output added. Much more sophisticated maximum likelihood parameter estimation for all model parameters including those of rate heterogeneity. Codon positions selectable. Update to mtREV24. New icon. Less verbose runtime messages. HTML documentation. Better internal error classification. More information in outfile (number of constant postions etc.). |
2.5.1 | Fix of a bug (present only in version 2.5) related to computation of the variance of the maximum likelihood branch lengths that caused occasional crashs of PUZZLE on some systems when applied to data sets containing many very similar sequences. Drop of support for non-FPU Macintosh version. Corrections in manual. |
2.5 | Improved QP algorithm (Strimmer, Goldman, and von Haeseler 1997). Bug fixes in ML engine, computation of ML distances and ML branch lengths, optional input of a user tree, F84 model added, estimation of all TN model parameters and corresponding standard errors, CLUSTAL W treefile convention adopted to allowe to show branch lengths and QP support values simultaneously, display of unresolved quartets, update of mtREV matrix, source code more compatible with some almost-ANSI compilers, more safety checks in the code. |
2.4 | Automatic data type recognition, chi-square-test on base composition, automatic selection of best amino acid model, estimation of transition-transversion parameter, ASCII plot of quartet puzzling tree into the outfile. |
2.3 | More models, many usability improvements, built-in consensus tree routines, more supported systems, bug fixes, no more dependencies of input order. First EBI distributed version. |
2.2 | Optimized internal data structure requiring much less computer memory. Bug fixes. |
2.1 | Bug fixes concerning algorithm and transition/transversion parameter. |
2.0 | Complete revision merging the maximum likelihood and the quartet puzzling routines into one user friendly program. First electronic distribution. |
1.0 | First public release, presented at the 1995 phylogenetic workshop (15-17 June 1995) at the University of Bielefeld, Germany. |
Send mail to Michael
Garrick with questions or comments about this web site.
|