Thursday, 18 October 2012

Plotting ages of phylogenetic trees in R

It appears that different genes might estimate very different age estimates for your phylogenetic trees. This seems to be the case with the COI gene that tends to pull your timings towards the past. Saturation in the 3rd codon position might be accused for responsibility  (you can inspect the saturation level with some cool plots).

Brandley et al (2011) use a nice way to figure out whether some of your genes might be giving very different time estimates for your trees. You can select 1000 random trees from your BEAST run and plot the distributions of the ages for the crown group of different genes, different codon positions and the combined analyses.

Something like this plot consisting on a simulation of a gen1 estimating a crown age of 30Mya, gen2 estimating an age of 50Mya and the combined analysis giving an age of 40Mya.

This can be done in the ubiquitous statistical software R, and here is the code:


# simulate three sets of trees with different ages
# you will skip this step and use your own tree files instead
trees_gen1 <-, numbsim=100, lambda=0.03, mu=0.001, age=30);
for(i in 1:length(trees_gen1)) {
  write.tree(trees_gen1[[i]], file="trees_gen1.nwk", append=TRUE)

trees_gen2 <-, numbsim=100, lambda=0.03, mu=0.001, age=50);
for(i in 1:length(trees_gen2)) {
  write.tree(trees_gen2[[i]], file="trees_gen2.nwk", append=TRUE)

trees_combined <-, numbsim=100, lambda=0.03, mu=0.001, age=40);
for(i in 1:length(trees_combined)) {
  write.tree(trees_combined[[i]], file="trees_combined.nwk", append=TRUE)

# you might want to plot the ages for an ingroup or a particular clade
# then you might want to remove certain taxa or the outgroups
# use this variable and replace with your real tip names
outgroup_tips <- c(); # add your tips as a vector, for example: c("Aus", "Bus", "Cus");

## Function to get distribution of ages for root from 1000 tree file
get_crown_age_distribution <- function(file, outgroup_tips) {
  tips <- outgroup_tips;
  phys <- read.tree(file)

  # get the branching times for the crown
  branching_times <- c();
  for( i in 1:length(phys)  ) {
    phy <- phys[[i]]
    if( length(tips) > 0) {
      phy <- drop.tip(phys[[i]], tips);
    # the the node number for the root
    nodes <- length(phy$tip.label) +  1
    x <- branching.times(phy)
    # get the branching time for the root
    branching_times <- c(branching_times, x[names(x) == nodes])

## get data
gen1 <- get_crown_age_distribution(file="trees_gen1.nwk", outgroup_tips=outgroup_tips)
gen2 <- get_crown_age_distribution(file="trees_gen2.nwk", outgroup_tips=outgroup_tips)
combined <- get_crown_age_distribution(file="trees_combined.nwk", outgroup_tips=outgroup_tips)

hist(gen1, freq=FALSE, axes=F, xlim=c(80,0), 
     border="white", main="Age posterior probability distributions\nof simulated trees",
     ylab="", xlab="Million years ago")
axis(4, at=seq(0.0,0.6, 0.1), las=1, font=2)
axis(1, font=2)

lines(density(combined), lwd=2, col="black")
lines(density(gen1), lwd=2, col="red")
lines(density(gen2), lwd=2, col="blue")


legend.txt <- c("Data combined", "gen1", "gen2")
legend.colors <- c("black",      "red", "blue")
legend(cex=0.8,"topleft", legend.txt, pch=22, lwd=0,, title="Locus", pt.cex=2)

Sunday, 14 October 2012

Plots de saturación para secuencias moleculares en R

Estuve buscando un programa que realice gráficos mostrando los niveles de transiciones y tranversiones en secuencias de ADN. Se supone que el tercer codon de una secuencia de ADN tiende a llegar a "saturación" debido a que el código genético es degenerado (varios tripletes codifican el mismo aminoácido). Esta saturación ocurre cuando las mutaciones en el tercer codón han sido tan frecuentes que ya no llevan información filogenética. Es decir, se llega al grado en que dos secuencias saturadas son parecidas simplemente por chance. El programa DAMBE realiza este tipo de gráficos al plotear la relación de transiciones y transversiones versus distancia genética, pero no pude hacer funcionar la versión para Linux. Encontré una función escrita para el programa estadístico R que sí me funcionó. La versión original la pueden encontrar aquí:

Download sample data here:


# modified from

par(mfcol=c(1,1)); par(mar=c())
par(xpd=F, mar=c(5,4,4,2)+0.2, family="Palatino")

#Input: dat---an object of class 'DNAbin'

  res<-matrix(NA, ncol=dim(mat)[1], nrow=dim(mat)[1], dimnames=list(x=names(dat), y=names(dat)))
  for(i in 1:(dim(mat)[1] - 1)){
    for(j in (i+1):dim(mat)[1]){

create_plots<-function(files) {
  for( i in 1:length(files)) {
    codon <- as.DNAbin([i]))
    if (i == 1) {
      R1 <- ti[lower.tri(ti)]/tv[lower.tri(tv)]
      dist1 <- dist.dna(codon, model="JC69", gamma=T, pairwise.deletion=T)
    else if( i == 2) {
      R2 <- ti[lower.tri(ti)]/tv[lower.tri(tv)]
      dist2 <- dist.dna(codon, model="JC69", gamma=T, pairwise.deletion=T)
    else if( i == 3) {
      R3 <- ti[lower.tri(ti)]/tv[lower.tri(tv)]
      dist3 <- dist.dna(codon, model="JC69", gamma=T, pairwise.deletion=T)
    else {
  # print plot
  xname = expression("JC+" * Gamma * "corrected distances")
  yname = "Transition/Tranverstion ratio"
  plot(NA, xlim=c(0,.6), ylim=c(0,40), xlab=xname, ylab=yname, main="Saturation plot for Wingless gene")
  points(R3~dist3, xlim=c(0,.6), ylim=c(0,40), col="red", pch=20, xlab="", ylab="")
  points(R1~dist1, xlim=c(0,.6), ylim=c(0,40), col="grey", pch=20, xlab="", ylab="")
  points(R2~dist2, xlim=c(0,.6), ylim=c(0,40), col="cyan", pch=20, xlab="", ylab="")

  legend.txt <- c("1st position", "2nd position", "3rd position")
  legend.colors <- c("grey",  "cyan", "red")
  legend("topright", legend.txt, pch=19, col=legend.colors, title="wingless", cex=0.9)

files <- c("wg_1.nex", "wg_2.nex", "wg_3.nex");

Y este es el resultado final para el gen wingless:
Es importante el orden al plotear los puntos (primero los más abundantes).
Así es posible ver todos.