Frequently Asked Questions (FAQ)


1. What is the minimum number of sequences required to get reliable results?

2. Does the entire sequence of the protein need to be submitted or will a partial sequence suffice?

3. What are the advantages of using phylogenetic trees?

4. What should I do if I find problems uploading files?

5. Protein Explorer does not run on my computer after a Selecton run is completed. How do I fix it?

6. What do the different fields in the output file termed 'Codon Ka/Ks scores (numerical values)' mean?

7. What is the best way to run Selecton, and how do I perform statistical testing for positive selection?

8. How do I choose which model to use?

9. What tree format does Selecton accept?

10. How are the Ka/Ks confidence intervals computed?

11. What are the parameters of each model?

12. What is the difference between ω and Ka/Ks in the MEC model?

13. What is "job status"? Why do I need one?



1. What is the minimum number of sequences required to get reliable results?
There is not an exact answer to this, as the distance between the sequences matters. Nevertheless, as a 'rule of thumb' we recommend a minimum of 10 homologues, although Selecton will also accept a minimum of 3 sequences. An MSA of less than 10 homologues does not usually supply enough sequential information. On the other hand, if too many distant homologues are used, saturation of the number of silent mutations (Ks) may occur, thus distorting the Ka/Ks ratio computations. Again, as a 'rule of thumb', we recommend using sequences in which the distance is in the order of magnitude of the distances in mammalian sequences. For sequences which are more distant, caution should be taken when interpreting the results of Selecton.

2. Does the entire sequence of the protein need to be submitted or will a partial sequence suffice?
The server does not require the full sequence of the protein, and partial sequences may be accepted. However, the user is responsible for interpreting the reliability of such results, since selective forces acting on the entire protein will not be properly taken into account (see "Estimating Ka/Ks" section in OVERVIEW). When a PDB file is used, the query sequence is extracted from the MSA file and aligned against the sequence extracted from the ATOM field of the PDB file. If there is a significant difference (less than 60% homology), Selecton will exit. Otherwise, a warning will be given but the calculation will continue. .

3. What are the advantages of using phylogenetic trees?
Phylogenetic trees take into account the distances between different sequences. For instance, if many sites of a human and a chimpanzee seuquence are identical this is not surprising. However, a high level of identity between a human and a cucumber sequence is by far more surprising. Methods which take into account the phylogenetic tree (as opposed to methods which only take into account the multiple sequence alignment) deal better with relations between sequences, by weighting clusters of related sequences differently. This clustering process also diminishes the influence of redundant sequences.
By default, Selecton computes a neighbor-joining tree. However, we recommend providing Selecton a more precise phylogenetic tree (for example, a tree computed using maximum likelihood reconstruction methods).

4. What should I do if I find problems uploading sequence files?
  • The accepted format for DNA sequence files is fasta:
    >seq1
    ATGTAT
    >seq2
    ATGAAT
    If you are working with other sequence file formats such as Clustal, Phylip, etc., we suggest using software such as Bioedit to change your format to fasta.
  • Check your file in a simple text editor (e.g. Notepad on Windows). It is very common that files downloaded from the web contain unnecessary characters. Eliminate them, and save your file as text only.
  • We have found some kind of incompatibility between the text format of PC / Unix and Mac machines. If you are running the Selecton server from a Mac platform, and you get repetitive error messages, we recommend that you save your file using Word as an "MS-Dos" text file. This format should be compatible with the Dos and Unix text files.
  • Check that your file format is correct. If you are supplying a codon-aligned file (this is an example of a codon alignment, and this is an example of a non-codon alignment), please verify that there are no internal stop-codons, and that all sequences are divisible by 3.
  • When uploading a phylogenetic tree: Please make sure that the names in the phylogenetic tree are precisely the same as the names in the DNA sequences file. Any mismatch will lead to Selecton crashing.

5. Protein Explorer does not run on my computer after a Selecton run is completed. How do I fix it?
Protein Explorer is compatible with Internet Explorer 5.5/6 and Netscape 4.7x and requires the installation of Chime 2.6 SP4.
For additional information, see the "Troubleshooting Chime and Protein Explorer".
If the above link has not solved your problem, or your browser is incompatible with Chime, the Selecton server produces a RasMol coloring script source, which can be used to view the color-coded results using Rastop.

6. What do the different fields in the output file termed 'Codon Ka/Ks scores (numerical values)' mean?
Ka/Ks is presented on the query sequence provided by the user.
  • The first column is the amino acid number.
  • The second column is the amino acid.
  • The third column is the actual Ka/Ks value.
  • The fourth column is the confidence interval (95%) of the Ka/Ks score.
  • The fifth column will contain an asterisk if the Ka/Ks value > 1 and the lower bound of confidence interval is also above 1.

7. What is the best way to run Selecton, and how do I perform statistical testing for positive selection?
There are several different models and methods which may be run in Selecton, and it is recommended to read the relevant literature (e.g. Yang et al. (2000), Massingham et al. (2005)) before choosing your preferable method.
However, trends in recent literatures are somewhat in favor of using an empirical Bayesian approach. To this end, these are the steps which we recommend when using Selecton:
  1. Run Selecton using a model which allows positive selection (M8 or MEC) (how do I choose a model?).

  2. At the end of the run, use the graphical display of the Selecton results to see if any sites are colored in yellow, i.e. display signs of positive selection. If the answer is no - this means that there is no evidence of positive selection in this protein. If the answer is yes, there is reason to suspect positive selection in the protein you are studying. Now, a statistical test must be performed to verify that the positive selection signal is significent. In order to do so, click on the button termed "Test Statistical Significance".
    This will automatically run the relevant null model (M8a by default for both M8 and MEC models) which does not allow for positive selection. At the end of the null model run, Selecton will compare the likelihood values under both models, and use either a likelihood ratio test (LRT) for a M8-M8a comparison (for more details see for example, Anisimova et al. 2001) or compare AIC scores for a MEC-M8a comparison(see, for example, Doron-Faigenboim et al. 2006). This test is performed in order to determine whether it is justified to assume there is positive selection operating on the gene: if the positive selection model has a significantly higher likelihood score (as determined by LRT or AIC), this signifies there is positive selection operating on the protein. Selecton will report the level of significance (or non-significance) at the end of the null model run.

  3. If the result of the previous step indicates that positive selection is statistically significant, use the Ka/Ks values which resulted from the model enabling positive selection (M8 or MEC) run to determine site-specific positive selection: sites colored in yellow are those with Ka/Ks values higher than 1.
    Note that also here, only sites with Ka/Ks > 1 where the 95% confidence interval is larger than 1 (i.e., the lower bound is larger than 1) are considered as significant. These sites are colored in dark yellow by Selecton.


8. How do I choose which model to use?
  1. Choosing a positive-selection enabling model (M8, MEC, M5):

    • By default Selecton runs the M8 model. The main advantage of this model is that it allows for all types of selection (purifying, neutral and positive), and allows nesting of models which do not allow positive selection. The disadvantage of this model is that all amino-acid replacements are weighted (almost) equally. For instance, the probabilities of leucine (UUG) being replaced by either tryptophan (UGG) or phenylalanine (UUU) are equal, since both require one transversion. However according to the commonly used amino-acid similarity matrices (e.g., the JTT matrix or any other PAM matrix) the latter is five times more likely than the former.
    • For more precise testing, we recommend running the MEC model (under advanced options). Since the MEC model takes into account the different amino acid probabilities, the inferences of Ka may be different.
      Thus, if for instance you suspect that positive selection is less affected by moderate replacements, it is preferable to use the MEC model over other models. The disadvantage of the MEC model is that it is more computationally intensive than the other models and hence may take more time to run. To summarize, we recommend the use of MEC for more experienced users who wish to find more subtle
    • The M5 model may also allow for all types of selection, including positive selection. It is slightly less computationally intensive compared to the M8 model. However, there are two disadvantages in using the M5 model: (a) It is possible that none of the gamma distribution categories will enable positive selection. This is especially probable when most of the protein is subject to strong purifying selection, with only a few sites undergoing positive selection. Thus, when using the M5 model we recommend using a large number of categories (we recommend using the upper limit of Selecton which is 14) to estimate the underlying gamma distribution, (b) There is no null model for the M5. Thus, an AIC score comparison should be performed with a null model in order to test the statistical significance of positive selection, if detected.

  2. Choosing a null model:
  3. At the end of the run, you will see a button labled "Perform Statistical Testing". For both the M8 model, this will automatically run the M8a model, and perform either a likelihood ratio test or compare AIC scores between the two models.
    As an alternative to the M8a null model, the user may run the M7 model. This model is more appropriate in cases of intense purifying selection, since it assmes all ω values are taken from a beta distribution (defined on the interval [0.1]). The disadvantages of the M7 model are discussed in Swanson et al. 2003.


9. What tree format does Selecton accept?
Selecton requires trees in Newick format, preferably with no bootstrap values.
Please make sure that the names of the sequences contain no unusual characters (such as brackets(), colons:, or semicolons;). Also, please verify that the names in the tree file are identical to those in the sequence file
This is an example for Newick format:
((((SEQ1:0.19792,SEQ2:0.21875):0.18750,SEQ3:0.22917):0.10417,SEQ4:0.001):0.008,SEQ6:0.009,SEQ7:0.2886);

10. How are the Ka/Ks confidence intervals computed?
The confidence interval is calculated from the posterior distribution of Ka/Ks values, and is defined as the 5th and 95th percentile of the posterior distribution. De-facto, since Selecton works with a discrete distribution, we calculate the cumulative distribution function (CDF) over all of the posterior distribution values. Once the CDF is larger than 0.05, this defines the lower bound of the confidence interval. The upper bound of the confidence interval is calculated similarly.
For example, for a position with the following posterior distribution:

Ka/Ks (ω)0.120.250.580.911.32
Posterior probability0.020.030.150.760.04

the confidence interval would be (0.58,0.91).

11. What are the parameters of each model?
For a more elaborate description of each of the models implemented in Selecton see here.
The parameters for each model are:
  • M8: α and β are the shape parameters of the beta distribution. κ is the transition/transversion ratio. ωs is the additional category representing positive selection. p1 is the proportion of ωs.
  • M8a: α, β, κ, p1. (ωs set to 1).
  • M7: α, β, κ
  • M5: α, β, κ
  • MEC: α, β, tr (rate of transition), tv (rate of transversion), f (proportion of sites under no selection)


12. What is the difference between ω and Ka/Ks in the MEC model?
In brief, the MEC model is a combination of two codon replacement probability matrices: one which assumes no selection, and one which is an expansion of an amino-acid replacement probability matrix (e.g. the JTT matrix), which does. The JTT matrix inheritently assumes selection. Thus, the ω values describe the selection of the studied dataset relative to the selection which is built in the JTT empirical amino-acid subsitution model. To be able to obtain a value in which 1 indicates neutral evolution, one has to normalize the estimated ω values relative to a neutral model. Such a normalization was first described in Goldman and Yang (1994), and is further described in detail in Doron-Faigenboim and Pupko (2007). In brief, the basic idea is to divide the ratio of non-synonymous rates to synonymous rates (Ka1/Ks1) by the same ratio expected under a neutral model (Ka0/Ks0).
Thus, Ka/Ks = (Ka1/Ks1) / (Ka0/Ks0).

13. What is "job status"? Why do I need one?
Starting February 2008 Selecton has moved to a new Linux platform, where a queing mechanism is implemented. The new platform is expected to improve running times, compared with previous versions of Selecton, as the new platform's performance is higher.
The queue is managed on a first-in-first-out basis. Depending on the queue load, each job obtains a status: Running or Queued. Once queued, the position in the queue is reported to the user. All the queued jobs will eventually get a running status. When the calculation is finished, an e-mail will be sent to the user with the job's results. For user's convenience, we added a "time tracking" mechanism, which reports the time that passed since the job was submitted to the server.



References
  • Anisimova M, Bielawski JP, Yang Z (2001) Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution. Mol Biol Evol 18:1585-92.

  • Doron-Faigenboim, A. and Pupko, T. (2006) A Combined Empirical and Mechanistic Codon Model. Mol Biol Evol, 24, 388-397

  • Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol 11:725-736.

  • Massingham T, Goldman N. (2005) Detecting amino acid sites under positive selection and purifying selection. Genetics 169(3):1753-62

  • Swanson, W.J., Nielsen, R. and Yang, Q. (2003) Pervasive adaptive evolution in mammalian fertilization proteins. Mol Biol Evol, 20, 18-20

  • Yang Z, Nielsen R, Goldman N, Pedersen AM (2000) Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431-449




For any problems or questions please contact us at bioSequence@tauex.tau.ac.il. Hope you enjoy!