|
Analyzing population structure using Slatkin & Maddison's s
Background
This is a parametric bootstrapping approach, where an explicit model is
used to generate a distribution of expected values of a test statistic. If the observed value of this
test statistic falls outside of the 95% confidence interval of the simulated distribution, the model is
rejected. In this case, the test statistic, s, is counting the number of changes
(in parsimony steps) of a character corresponding to population membership
(Slatkin & Maddison 1989). In the instructions below, we deal with two types
of taxa: genes and populations. Genes are the gene sequences used to generate a gene tree; each gene
sequence usually corresponds to a single sampled individual. Populations are those groupings, determined
by the user, for hierarchically organizing individual gene sequences into groupings based on geography,
morphology, or other information.
Instructions & Example
These instructions are provided with a worked example. Because each model being tested is likely
specific to the question being asked, many parameter values and models of population structure will
be unique to each study. The models and parameter values used in this example are for educational
purposes only, and we encourage investigators to allow reasonable consideration of model paremeters.
Getting Started — You will need a Mesquite file with the following parts:
- Taxa Block corresponding to genes sampled
- Reconstructed gene tree
- One or more Taxa Blocks corresponding to populations — You may have multiple population
taxa blocks corresponding to different models of population structure. For the ease of this example,
we will only deal with a single population taxa block.
- Association
between the gene taxa and population taxa
- Population Tree Block corresponding to model(s) being tested — Instructions for creating these
model trees are provided below.
Calculating observed value of s
The easiest way to do this is to add a Tree Legend to the gene tree. If you do not have a tree window
open showing the reconstructed gene tree, open one using Taxa&Trees > New Tree
Window. Add the legend from the Analysis menu (Analysis > Tree
Legend...), and select "s of Slatkin & Maddison". If you have more than one
association included in the file, you will be asked to choose which association to use for the calculations.
Record the number that appears in the tree legend — this is the observed value of the test statistic.
In our example, we have 20 gene sequences, sampled from 4 populations. The color of the gene sequences
corresponds to the population from which they were sampled (note: taxon name colors are only used for aesthetics
and are not required for this type of analysis. If you would like to color taxon names, you will
need to set up "Groups"
of taxa, which are not the same as associations.). In this example, s is counting the number of changes
(in parsimony steps) of a character corresponding to population membership (as indicated by color):

Counting the most parsimonious number of changes leads to s = 9
|
Setting up models to be tested
Begin by opening a new tree window, corresponding to the population block of taxa. Select
Default Trees as the source of trees. Manipulate this tree to reflect the
model you wish to test. In order to preserve any changes you make to the tree, you must store the tree in a
Tree Block (Tree > Store Tree As...).
In our example, we would like to test two models; the first is a model of population fragmentation, in
which all four populations became isolated from one another, some time in the past, say 100,000
generations ago. Starting with a default symmetrical tree, we use the collapse branch tool
( ) to collapse the branches indicated
in red on the tree to the left, to create the tree on the right.
We can set all the branch lengths to equal 100000 generations from the Tree Menu: Tree
> Alter/Transform Branch Lengths... > Assign All Branch Lengths... and entering 100000 in
the dialog box that appears. Remember to use Tree > Store Tree in order
to retain these changes. Unless you have selected Drawing > Branches Proportional to
Lengths the tree drawing does not change after you've assigned the branch lengths. You can
toggle the branch lengths and scale from the Drawing Menu (Branches Proportional to
Lengths and Show Scale, respectively).
For our alternative model, we will test the hypothesis that Populations 1 & 2 and Populations
3 & 4 were isolated in two refugia 100000 generations ago, and only recently split
(1000 generations ago) from two into four populations. For this model tree, we manipulate the
fragmentation model tree to reflect the alternative model, and store a copy of this tree under another
name (using Store Tree As... instead of Store Tree to avoid overwriting the fragmentation model
tree; storing this model tree in the same tree block as the fragmentation model tree will make
things easier later on). Using the arrow tool ( ) in the population
tree window, we join the branches of populations 1 & 2 and populations 3 & 4 (below, left).
To assign branch lengths to individual branches, use the Adjust Branch Length tool
( ) on each branch, assigning branches
1-4 a value of 1000 generations (recent split) and branches 5 & 6 a value of 99000 (so the age
of the split is 100000 generations ago). Because the branch lengths of branches 1-4 are
relatively short, it may be difficult to visualize them when branches are drawn proportional to
lengths (below, right), so turning off Branches Proportional to Length may ease manipulation.
We will use the two model trees, fragmentation and refugia, to generate a simulated distribution
of the test statistic.
|
Simulating distributions of test statistic
To simulate a distribution of Slatkin & Maddison's s, bring the tree window
of the population tree corresponding to the model you wish to test to the front.
Select Analysis > New Bar & Line Chart for > Trees.
In the "Select Taxa" dialog that opens, select the block of taxa corresponding to
the gene sequences. Choose "Simulated Trees" as the source of trees, and "Coalescence
Contained within Current Tree" as the Tree Simulator (in older versions of Mesquite, you may
have to check the "Show Secondary Choices" box for this tree simulator to be an option).
Set the effective population size when prompted, and choose "s of Slatkin & Maddison"
as the value to calculate for trees (s of Slatkin & Maddison may be a Secondary Choice, so
you'll have to check the appropriate box for it to appear as an option). You will be
prompted to enter the number of trees for the chart, and asked if the current window is
appropriate for simulations (you should check to be sure the tree window corresponding to the
model you wish to test is the one Mesquite is querying you about). If you have multiple
associations in your file, you will be prompted to choose the association for generating
gene trees first, then you will be prompted again to choose the association for
calculating s. These will often be the same, but certain situations may arise
when the association on which the model is based is not the same as the association
for which the statistic is calculated.
Mesquite uses a stepwise process to generate the distribution of the test
statistic. It repeats the following steps N times, where N is the number of trees you
asked Mesquite to use for calculations:
- Simulate a gene tree within population gene tree
- Calculate the value of s for that gene tree and given association
- Record the value of s
- Discard gene tree
After Mesquite has recorded the simulated values, a Chart Window will show a histogram
of the simulated distribution. By clicking on the "Text" tab, you can see the
numbers used to generate the frequency distribution for the test statistic. These numbers
can be copied & pasted into a spreadsheet program to generate graphs outside of Mesquite.
Additionally, in the "Graphics" view, you can ask Mesquite to draw lines on
the distribution corresponding to 95% confidence interval. In the Chart Window,
select Chart > Analysis > Percentiles.... From here,
you can choose upper or lower bounds, as well as the size of the percentiles (e.g. for the
90% confidence interval, enter 0.1 for the Percentile Boundary).
The chart window displaying the distribution of s is dependent on the Tree
Window showing the population tree on which the gene trees were simulated. By default,
the Chart Window is set to re-calculate s if any changes are made to the Tree
Window. Thus, if you make changes to the Tree Window (re-arrange branches, change
branch lengths, or scroll between trees), the chart will re-simulate gene trees and
perform necessary calculations to reflect the distibution of s corresponding
to the tree shown in the Tree Window. The auto-recalculate option can be toggled on/off
from the Chart menu, if desired. However, if you are testing more than one model of
population structure, it is easiest to:
- Generate a distribution for the first model to be tested
- Record the distribution (copy & paste from the "Text" tab
into a spreadsheet program)
- In the Tree Window, scroll to the next model to be tested; the distribution
will be automatically recalculated, based on the new model
- Repeat for each model being tested, recording the distribution of s for each model
If the observed value of s falls outside of the 95% confidence interval for a
particular model, that model, and the associated parameter values (branch lengths, population
size, etc.) are not supported by the data.
We begin by simulating a distribution of s for the fragmentation model.
From the Tree Window showing the fragmentation model population tree, we select
Analysis > New Bar & Line Chart for > Trees,
choosing, as prompted, Genes Taxa Block (taxa to calculate value), Simulated Trees
(source of trees), Coalescence Contained within Current Tree (tree simulator),
10000 (effective population size), s of Slatkin & Maddison (Value to calculate),
1000 (trees for the chart), and check to be certain that Mesquite is using the
fragmentation model tree. The chart window opens, showing the distribution of
simulated values of s.
By clicking on the "Text" tab, we can see the details of the simulations:
Because our observed values of s = 9 was not within the 95% confidence interval
of the simulated distribution, we reject the fragmentation model.
To test the refugia model, we can take advantage of the Chart Window's dependancy
on the Tree Window. Leaving the Chart Window open, we can scroll to the refugia model
tree in the Tree Window.
The Chart Window will be updated with new simulations, based on the refugia model
tree (if it does not, you may need to turn on "Auto-recalculate" from the
Chart menu in the Chart Window). By selecting Chart > Analysis >
Percentiles..., setting the Percentile boundary to 0.05, and checking the
boxes for both upper and lower tails, we can see the 95% confidence interval for this
distribution:
Because our observed value s = 9 falls within the 95% confidence interval,
we do not reject the two refugia model. By clicking on the "Text" tab, the full
details of the distribution can be seen:
From these results, we reject the fragmentation model in favor of the two refugia model.
Further tests of deeper and shallower divergence times can provide confidence interval for
this model, for given parameter values.
|
In closing, it is important to consider parameter values (Ne, divergence times, etc.)
and the explanatory power of the tests. In the example above, if we had used an effective population size
100 times larger (Ne = 1,000,000), the fragmentation model is not rejected
(the observed value of s = 9 falls within the 95% confidence interval of the simulated distribution
of s). We encourage careful consideration of reasonable parameter values in these approaches.
Rather than relying on one or a few point estimates of parameter values, it may be most informative to
explore parameter space, and define that parameter space that would best explain the given pattern of
genetic variation.
|
|