The tree-reconstruction programs include the Q* method described in the paper
Inferring Evolutionary Trees with Strong
Combinatorial Evidence (TCS, to appear)
[Berry and Gascuel
97] and the Buneman
construction described in [Buneman 71] and revisited in
the paper Faster reliable phylogenetic
analysis (RECOMB 99) [Berry and Bryant 99].
The PhyloQuart package contains different kinds of programs:
----------------
----------------
5
Human
Chimp 0.4664
Gorilla 0.7236 0.6774
Orang 1.3870 1.4985 1.3158
Gibbon 1.4696 1.5257 1.4446 0.4860
- CHARACTER FILES (eg, "infile.nuc"):
they contain nucleotide sequences describing the sequences, with the
same format as the PHYLIP character files. The first lie indicates the
number of taxa, then the number of characters. The following lines
give the sequences. Each line begins with the name of the taxon (on
10 characters exaclty with blanks at the end if need be), then the
nucleotide sequence associated to the taxon begins at the first
nucleotide encountered starting from the 11th position of the line and
ends when as much characters are read as indicated in the first line
of the file.
Sequences may include blank characters to parse them every short
period of characters (e.g., every 10 chars), as output produced by the
readseq program.
Note that unlike PHYLIP, sequences can't be interleaved, each sequence
must be given at once). Here is an example of character file:
5 100
Human ATTTCCGGCATATATGCATATGTCTGTAACACATAGTGAGAGACGTCTATTACCTGGTCCCTACATTCCAGGTTCCCAGAAAAATCGGGAAACTGTCCCT
Chimp CCTTCTGGCATATGTGCATGACTCTGCTACACACAGTGAAAAAGGACTATTGTGTAATCCCTACAACCCTTGTTCTGAGAAAAATCGGGGAACTGTCCCC
Gorilla ATTTCCGGCCTATGTGCATACCTCTGTAACACATAGTGAGAGACGACCAATACGTAGTCGCTACATTCCTTGTCCTCAGAAAAATCGTGAAACTGTTCCT
Orang ATTTCCGGCCTATGTGCATACCCCTGTAACACATAGTGAGAGACGACCAATACGTAGTCGCCACATTCCTTGTCCTCAGAAAAATCGTGAAACTGTTCCT
Gibbon ACTTTCAGTATATGCGCATATCTTTGTAACACATAGCGAGAGACAACTGTTATGTGGTCCCTGCGTTCCCTGTTGTCTGATAAATCCGGAATCTGTCCCC
Note that no gap are currently allowed and that nucleotide sequences
are expected by the program. Nucleotide (T and U are treated in the
same way).
Morphological charcters can be incorporated by coding the presence of the character for a taxa as, e.g., the nucleotide A, and the absence of a character as, e.g., the nucleotide C. (see fossile-horses.nuc as an example. This file corresponds to the morphological example of input for the PHYLIP package).
- QUARTET FILES ("quartfile","quartfile.res","quartfile.left"):
they contain the number of taxa, an assignement of a number to every
taxon (one by line), then a list of quartets on the taxa' numbers
(one by line). Here is an example of a quartet file:
5
01 Human
02 Chimp
03 Gorilla
04 Orang
05 Gibbon
01 02 || 03 04
01 02 || 03 05
01 02 || 04 05
01 03 || 04 05
02 03 || 04 05
- BIPARTITION FILES (eg, "bipfile"):
such a file contains the number of taxa, the correspondance
number-name of taxa, the list of edges of a tree, described in terms
of the biparitions they each induce on the taxa set X (removing any
edge of a tree splits the taxa into two components and thus splits
X into two subsets, depending on the components its elements belong).
The first lines of the file contain an assignement of a number to
every taxon (one by line), then a list of bipartitions on the
taxa numbers (one by line). Each bipartition is followed by a
bracketed weight, eg, 32000 (ie, a big constant) in the case of the
QSTAR program, or a value indicating the ratio of the number of
quartets satisfied by the edge over the number of contradicted
quartets, in the case of the ADDQUART program. Here is an example of
bipartition file:
5
01 Human
02 Chimp
03 Gorilla
04 Orang
05 Gibbon
03 05 04 02 | 01 (32000)
02 | 03 05 04 01 (32000)
04 05 03 | 02 01 (32000)
03 | 04 05 01 02 (32000)
05 04 | 03 01 02 (32000)
04 | 05 02 01 03 (32000)
05 | 04 02 01 03 (32000)
---------
See the documentation files specific to each program (they have the
same name as the program, but have extension ".doc", e.g., "qstar.doc").
You might read the following papers for more information on the Q*
method and other quartet-based phylogeny reconstructuin methods:
- Berry V. and Bryant D., 1999, Faster reliable phylogenetic
analysis, 3rd Ann. Int. Conf. on Computational Biology (RECOMB'99).
- Berry V. and Gascuel O., 1998, Inferring evolutionary trees with strong
combinatorial evidence, Theoretical Computer Science (to appear).
- Berry V., 1997, Méthodes et Algorithmes pour Reconstruire les Arbres
de l'Evolution, Thèse de doctorat, Université de Montpellier, France.
- Bandelt H.J. and Dress A., 1986, Reconstructing the shape of a tree
from observed dissimilarity data, Adv. in Appl. Math, 7:309-343.
- Buneman P., 1971, The revovery of trees from measures of
dissimilarity, in Mathematics in archeological and historical
sciences, 387-395, Edhimburgh University Press.
---------
v1.4 : Improvement of quartet storing (after discussion with D. Swofford and K. Strimmer) allowing the handling of data sets with more taxa. Change in the bipfile format (weights removed). Also: running time of QSTAR improved.
Addquart can now accept weighted quartets and accepts more parameters.
Choice of edges improved.
v1.3 : accepted format of character files (infile.nuc) has been
extended. The sequences can now include blanks that split a sequence
every so often, as file output by the readseq program.
v1.3 : the parameter indicating the taxa number to the various
programs is no longer necessary. This information is now included in
the files exchanged by the programs.
v1.2 : taxa names can be speciefied instead of 2-digit numbers
previously.
----------------------
Vincent Berry (vberry AT lirmm.fr)
(Comments on the package or on this pages are welcome).