⬅️ BACK

‘Tidy’ Pop-Gen PCA plotting in R

James A. Fellows Yates

I recently was using the R package pvclust to test the ‘robusticity’ of clusters in a microbiome-related clustering analysis. While pvclust provides it’s own plots via plot() on a pvclust object, this plots the dendrogram in base R. For readability and customisability reasons I prefer using the packages ggplot2 and ggtree for making my figures. However, I was having a hard time to extract the node uncertainty values from the pvclust object and integrate them into a generic R phylo object for plotting the dendrogram in ggtree.

Fortunately, a bit of googling (a couple of days…) showed me someone else had already solved the problem of transferring additional hclust information into a phylo object but in a different context. The fastbap package has the function as.phylo.hclust.node.attributes() which essentially does what I needed to do - i.e. when calculating the node numbers from each merge event, also store with the same node number the corresponding attribute, or in this case,pvclust AU value.

I then modified this function slightly to make it more consistent with how pvclust will display the values in the base R plot (rounding and converting to a ‘percentage’). Note that this outputs in the phylo object the metadata node.label, not as node.attributes as in the original fastbaps function.

So in summary:

## 1. Make pvclust object e.g.
hclust_boot <- pvclust::pvclust(otu_matrix,
                 method.hclust = selected_method,
                 method.dist = "euclidean",
                 nboot = 1000,
                 parallel = T)

## 2. Set modified fastbaps function
as.phylo.pvclust.node.attributes <- function(x, attribute)
{
  N <- dim(x$merge)[1]
  edge <- matrix(0L, 2*N, 2)
  edge.length <- numeric(2*N)
  ## `node' gives the number of the node for the i-th row of x$merge
  node <- integer(N)
  node[N] <- N + 2L
  node.attributes <- rep(NA, N)
  cur.nod <- N + 3L
  j <- 1L
  for (i in N:1) {
    edge[j:(j + 1), 1] <- node[i]
    for (l in 1:2) {
      k <- j + l - 1L
      y <- x$merge[i, l]
      if (y > 0) {
        edge[k, 2] <- node[y] <- cur.nod
        cur.nod <- cur.nod + 1L
        edge.length[k] <- x$height[i] - x$height[y]
        node.attributes[edge[k, 1] - (N + 1)] <- attribute[i]
      } else {
        edge[k, 2] <- -y
        edge.length[k] <- x$height[i]
        node.attributes[edge[k, 1] -  (N + 1)] <- attribute[i]
      }
    }
    j <- j + 2L
  }

  if (is.null(x$labels))
    x$labels <- as.character(1:(N + 1))

  ## MODIFICATION: clean up node.attributes so they are in same format in
  ## pvclust plots
  node.attributes <- as.character(round(node.attributes * 100, 0))
  node.attributes[1] <- NA

  obj <- list(edge = edge, edge.length = edge.length / 2,
              tip.label = x$labels, Nnode = N, node.label = node.attributes)
  class(obj) <- "phylo"
  stats::reorder(obj)
}

## 3. Use the modified fastbaps function by accessing the hclust object in first
## position, and the corresponding au values from the edges list entry.
hclust_boot_phylo <- as.phylo.pvclust.node.attributes(hclust_boot$hclust,
                                                     hclust_boot$edges$au)

To display the values on the tree with ggtree you can then run

ggtree(hclust_boot_phylo  aes(x, y)) +
    geom_text2(aes(subset = !isTip, label = label))

as described in the ggtree FAQ under the section “bootstrap values from newick format”.