For any questions or suggestions please contact us
GoTo ...
Guidance
FastML
M1CR0B1AL1Z3R
Selecton
ConSurf
|
Frequently Asked Questions (FAQ)
1. What is the minimum number of sequences required to get reliable results?
2. Does the entire sequence of the protein need to be submitted or will a partial sequence suffice?
3. What are the advantages of using phylogenetic trees?
4. What should I do if I find problems uploading files?
5. Protein Explorer does not run on my computer after a Selecton run is completed. How do I fix it?
6. What do the different fields in the output file termed 'Codon
Ka/Ks scores (numerical values)' mean?
7. What is
the best way to run Selecton, and how do I perform statistical testing
for positive selection?
8. How do I
choose which model to use?
9. What tree
format does Selecton accept?
10. How are the
Ka/Ks confidence intervals computed?
11. What are
the parameters of each model?
12. What is the
difference between ω and Ka/Ks in the MEC model?
13. What is "job status"? Why do I need one?
1. What is the minimum number of sequences required to get reliable results?
There is not an exact answer to this, as the distance between the
sequences matters. Nevertheless, as a 'rule of thumb' we recommend a
minimum of 10 homologues, although Selecton will also accept a minimum
of 3 sequences. An MSA of less than 10 homologues does not usually supply enough sequential information.
On the other hand, if too many distant homologues are used, saturation of the number of
silent mutations (Ks) may occur, thus distorting the Ka/Ks ratio
computations. Again, as a 'rule of thumb', we recommend using sequences
in which the distance is in the order of magnitude of the distances in
mammalian sequences. For sequences which are more distant, caution
should be taken when interpreting the results of Selecton.
2. Does the entire sequence of the protein need to be submitted or will a partial sequence suffice?
The server does not require the full sequence of the protein, and
partial sequences may be accepted. However, the user is responsible for
interpreting the reliability of such results, since selective forces
acting on the entire protein will not be properly taken into account
(see "Estimating Ka/Ks" section in OVERVIEW). When a PDB file is used, the query
sequence is extracted from the MSA file and aligned against the sequence
extracted from the ATOM field of the PDB file. If there is a significant
difference (less than 60% homology), Selecton will exit. Otherwise, a warning will be given but the calculation will continue. .
3. What are the advantages of using phylogenetic trees?
Phylogenetic trees take into account the distances between different
sequences. For instance, if many sites of a human and a chimpanzee
seuquence are identical this is not surprising. However, a high level of identity between a human and a
cucumber sequence is by far more surprising. Methods which take into
account the phylogenetic tree (as opposed to methods which only take into
account the multiple sequence alignment) deal better with relations
between sequences, by weighting clusters of related sequences
differently. This clustering process also diminishes the influence of
redundant sequences.
By default, Selecton computes a neighbor-joining tree. However, we
recommend providing Selecton a more precise phylogenetic tree (for
example, a tree computed using maximum likelihood reconstruction methods).
4. What should I do if I find problems uploading sequence files?
- The accepted format for DNA sequence files is fasta:
>seq1
ATGTAT
>seq2
ATGAAT
If you are
working with other sequence file formats such as Clustal, Phylip, etc., we suggest
using software such as Bioedit to
change your format to fasta.
- Check your file in a simple text editor (e.g. Notepad on Windows). It is very common that files downloaded from the web contain unnecessary characters. Eliminate them, and save your file as text only.
- We have found some kind of incompatibility between the text format of PC / Unix and Mac machines. If you are running the Selecton server from a Mac platform, and you get repetitive error messages, we recommend that you save your file using Word as an "MS-Dos" text file. This format should be compatible with the Dos and Unix text files.
- Check that your file format is correct. If you are supplying a
codon-aligned file (this is an example of a codon alignment, and
this is an example of a non-codon alignment), please verify that there are no internal
stop-codons, and that all sequences are divisible by 3.
- When uploading a phylogenetic tree: Please make sure that the
names in the phylogenetic tree are precisely the same as the names in
the DNA sequences file. Any mismatch will lead to Selecton crashing.
5. Protein Explorer does not run on my computer after a Selecton run is completed. How do I fix it?
Protein Explorer is compatible with Internet Explorer 5.5/6 and Netscape 4.7x and requires the installation of Chime 2.6 SP4.
For additional information, see the "Troubleshooting Chime and Protein Explorer".
If the above link has not solved your problem, or your browser is incompatible with Chime, the Selecton server produces a RasMol coloring script source, which can be used to view the color-coded results using Rastop.
6. What do the different fields in the output file termed 'Codon
Ka/Ks scores (numerical values)' mean?
Ka/Ks is presented on the query sequence provided by the user.
- The first column is the amino acid number.
- The second column is the amino acid.
- The third column is the actual Ka/Ks value.
- The fourth column is the confidence interval (95%) of the Ka/Ks
score.
- The fifth column will contain an asterisk if the Ka/Ks value > 1 and the
lower bound of confidence interval is also above 1.
7. What is the best way to run Selecton, and how do I perform statistical testing for positive selection?
There are several different models and methods which may be run in
Selecton, and it is recommended to read the relevant literature
(e.g. Yang et al. (2000), Massingham et
al. (2005)) before
choosing your preferable method.
However, trends in recent literatures are somewhat in favor of using an
empirical Bayesian approach. To this end, these are the steps which we
recommend when using Selecton:
- Run Selecton using a model which allows positive selection (M8 or
MEC) (how do I choose a model?).
- At the end of the run, use the graphical display of the Selecton
results to see if any sites are colored in yellow, i.e. display signs of
positive selection. If the answer is no - this means that there is no
evidence of positive selection in this protein. If the answer is yes, there is reason to suspect positive selection
in the protein you are studying. Now, a statistical test must be
performed to verify that the positive selection signal is significent. In order to do so, click on the button termed "Test
Statistical Significance".
This will automatically run the relevant null
model (M8a by default for both M8 and MEC models) which does not allow
for positive selection. At the end of the
null model run, Selecton will compare the likelihood values under both
models, and use either a likelihood
ratio test (LRT) for a M8-M8a comparison (for more details see for example, Anisimova et
al. 2001) or compare AIC scores for a MEC-M8a comparison(see, for example, Doron-Faigenboim et
al. 2006). This test is performed in order to determine whether it is justified to assume there is
positive selection operating on the gene: if the positive selection
model has a significantly higher likelihood score (as determined by LRT
or AIC), this signifies there is positive selection operating on the
protein. Selecton will report the level of significance (or
non-significance) at the end of the null model run.
- If the result of the previous step indicates that positive
selection is statistically significant, use the Ka/Ks values which resulted from the model enabling
positive selection (M8 or MEC) run to
determine site-specific positive selection: sites colored in yellow are
those with Ka/Ks values higher than 1.
Note that also here, only sites with Ka/Ks > 1 where the 95% confidence
interval is larger than 1 (i.e., the lower bound is larger than 1)
are considered as
significant. These sites are colored in dark yellow by Selecton.
8. How do I choose which model to use?
- Choosing a positive-selection enabling model (M8, MEC, M5):
-
By default Selecton runs the M8 model. The main advantage of this model
is that it allows for all types of selection (purifying, neutral and
positive), and allows nesting of models which do not allow positive
selection. The disadvantage of this model is that all amino-acid
replacements are weighted (almost) equally. For instance, the
probabilities of leucine (UUG) being replaced by either tryptophan (UGG)
or phenylalanine (UUU) are equal, since both require one
transversion. However according to the commonly used amino-acid
similarity matrices (e.g., the JTT matrix or any
other PAM matrix) the latter is five times more likely than the former.
- For more precise testing, we recommend running the MEC model (under
advanced options). Since the MEC model takes into account the different
amino acid probabilities, the inferences of Ka may be different.
Thus, if for instance you suspect that positive selection is less
affected by moderate replacements, it is preferable to use the MEC model
over other models. The disadvantage of the MEC model is that it is more
computationally intensive than the other models and hence may take more
time to run. To summarize, we recommend the use of MEC for more
experienced users who wish to find more subtle
- The M5 model
may also allow for all types of selection, including positive
selection. It is slightly less computationally intensive compared to the M8
model. However, there are two disadvantages in using the M5 model: (a)
It is possible that none of the gamma distribution categories will enable positive
selection. This is especially probable when most of the protein is
subject to strong purifying selection, with only a few sites undergoing
positive selection. Thus, when using the M5 model
we recommend using a large number of categories (we recommend using the
upper limit of Selecton which is 14) to estimate the
underlying gamma distribution, (b) There is no null model for the
M5. Thus, an AIC score comparison should be performed with a null model
in order to test the statistical significance of positive selection, if detected.
- Choosing a null model:
At the end of the run, you will
see a button labled "Perform Statistical Testing". For both the M8
model, this will
automatically run the M8a model, and perform either a likelihood
ratio test or compare AIC scores
between the two models.
As an alternative to the M8a null model, the user may run the M7 model. This
model is more appropriate in cases of intense purifying selection, since
it assmes all ω values are taken from a beta distribution (defined
on the interval [0.1]). The disadvantages of the M7 model are discussed
in Swanson et al. 2003.
9. What tree format does Selecton accept?
Selecton requires trees in Newick format, preferably with no bootstrap
values. Please make sure that the names of the sequences contain no unusual characters (such as brackets(), colons:, or semicolons;). Also, please verify that the names in the tree file are identical to those in the sequence file
This is an example for Newick format:
((((SEQ1:0.19792,SEQ2:0.21875):0.18750,SEQ3:0.22917):0.10417,SEQ4:0.001):0.008,SEQ6:0.009,SEQ7:0.2886);
10. How are the Ka/Ks confidence intervals computed?
The confidence interval is calculated from the posterior
distribution of Ka/Ks values, and is defined as the 5th and 95th
percentile of the posterior distribution. De-facto, since Selecton works
with a discrete distribution, we calculate the cumulative distribution
function (CDF) over all of the posterior distribution values. Once
the CDF is larger than 0.05, this defines the lower bound of the
confidence interval. The upper bound of the confidence
interval is calculated similarly.
For example, for a position with the following posterior
distribution:
Ka/Ks (ω) | 0.12 | 0.25 | 0.58 | 0.91 | 1.32 |
Posterior probability | 0.02 | 0.03 | 0.15 | 0.76 | 0.04 |
the confidence interval would be (0.58,0.91).
11. What are the parameters of each model?
For a more elaborate description of each of the models implemented in
Selecton see here.
The parameters for each model are:
- M8: α and β are the shape parameters of the beta
distribution. κ is the transition/transversion ratio. ωs is the additional category
representing positive selection. p1 is the proportion of
ωs.
- M8a: α, β, κ, p1. (ωs set to
1).
- M7: α, β, κ
- M5: α, β, κ
- MEC: α, β, tr (rate of transition), tv (rate of
transversion), f (proportion of sites under no selection)
12. What is the difference between ω and Ka/Ks in the MEC model?
In brief, the MEC model is a combination of two codon replacement
probability matrices: one which assumes no selection, and one which is
an expansion of an amino-acid replacement probability matrix (e.g. the
JTT matrix), which does. The JTT matrix inheritently assumes
selection. Thus, the ω values describe the selection of the
studied dataset relative to the selection which is built in the JTT
empirical amino-acid subsitution model. To be able to obtain a value in
which 1 indicates neutral evolution, one has to normalize the estimated
ω values relative to a neutral model. Such a normalization was
first described in Goldman and Yang (1994), and is further described in
detail in Doron-Faigenboim and Pupko (2007). In brief, the basic idea is
to divide the ratio of non-synonymous rates to synonymous rates
(Ka1/Ks1) by the same ratio expected under a
neutral model (Ka0/Ks0).
Thus, Ka/Ks = (Ka1/Ks1) / (Ka0/Ks0).
13. What is "job status"? Why do I need one?
Starting February 2008 Selecton has moved to a new Linux platform, where a queing mechanism is implemented.
The new platform is expected to improve running times, compared with previous versions of Selecton, as the new platform's performance is higher.
The queue is managed on a first-in-first-out basis. Depending on the queue load, each job obtains a status: Running or Queued. Once queued, the position in the queue is reported to the user. All the queued jobs will eventually get a running status. When the calculation is finished, an e-mail will be sent to the user with the job's results. For user's convenience, we added a "time tracking" mechanism, which reports the time that passed since the job was submitted to the server.
References
- Anisimova M, Bielawski JP, Yang Z (2001) Accuracy and power of the
likelihood ratio test in detecting adaptive molecular evolution. Mol
Biol Evol 18:1585-92.
- Doron-Faigenboim, A. and Pupko, T. (2006) A Combined Empirical and
Mechanistic Codon Model. Mol Biol Evol, 24, 388-397
- Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for
protein-coding DNA sequences. Mol Biol Evol 11:725-736.
- Massingham T, Goldman N. (2005) Detecting amino acid sites under
positive selection and purifying selection. Genetics 169(3):1753-62
- Swanson, W.J., Nielsen, R. and Yang, Q. (2003) Pervasive adaptive evolution in mammalian fertilization proteins. Mol Biol Evol, 20, 18-20
- Yang Z, Nielsen R, Goldman N, Pedersen AM (2000) Codon-substitution
models for heterogeneous selection pressure at amino acid
sites. Genetics 155:431-449
For any problems or questions please contact us at
evolseq@tauex.tau.ac.il. Hope you enjoy!
|