Metagenomics. Statistics.

Random Forests.

The random forests technique examines a large ensemble of decision trees, by first generating a random sample of the original data with replacement (bootstrapping), and using a user-defined number of variables selected at random from all of the variables to determine node splitting. Multiple subsets of trees are built, and the support for the role of each variable in each decision is noted.

Supervised and non-supervised random forests.

You can run random forests in two different modes: either unsupervised or supervised. In the unsupervised mode the clusters are formed for you, by trying some clusters (nominally 2) and measuring the proximity between data points. In the supervised mode the random forest will run as a classifier, and will measure the importance of all variables and the ability of each variable to classify the data appropriately.

Variable importance plot.

The variable importance plot is a critical output of the random forest algorith. For each variable in your matrix it tells you how important that variable is in classifying the data. The plot shows each variable on the y-axis, and their importance on the x-axis. They are ordered top-to-bottom as most- to least-important. Therefore, the most important variables are at the top and an estimate of their importance is given by the position of the dot on the x-axis. You should use the most important variables, as determined from the variable importance plot, in the PCA, CDA, or other analyses. Typically, you should look for a large break between variables to decide how many important variables to choose. This is an important tool for reducing the number of variables for other data analysis techniques, but you should be careful not to have either too few variables (that won't separate the data) or too many variables (that will over explain the differences).

Importance Table

You can also get a list of the importance of every variable in classifying your data. With a supervised random forest, the function importance() (which takes a random forest as an argument like this: imp <- importance(rfAll)) will provide a table of all the response variables and all the observations and the importance of each. The importance also provides a summary of both mean decrease in Gini and mean decrease in accuracy.

Decrease in acccuracy.

The mean decrease in accuracy a variable causes is determined during the out of bag error calculation phase. The more the accuracy of the random forest decreases due to the exclusion (or permutation) of a single variable, the more important that variable is deemed, and therefore variables with a large mean decrease in accuracy are more important for classification of the data. The mean decrease in Gini coefficient is a measure of how each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest. Each time a particular variable is used to split a node, the Gini coefficient for the child nodes are calculated and compared to that of the original node. The Gini coefficient is a measure of homogeneity from 0 (homogeneous) to 1 (heterogeneous). The changes in Gini are summed for each variable and normalized at the end of the calculation. Variables that result in nodes with higher purity have a higher decrease in Gini coefficient.

A type 1 variable importance plot shows the mean decrease in accuracy, while a type 2 plot shows the mean decrease in Gini.

Code to run random forests.

This code will take the normalized data and run an supervised random forest based on the vluaes in the first column of the data.

Download.

Download the r code

Import directly into R:

source("http://dinsdalelab.sdsu.edu/metag.stats/code/randomforest.r");

Copy.

# load the random forest library
library(randomForest)
# run a supervised random forest
rfAll <- randomForest(data[,2:ncol(data)], 
	data[,1], importance=TRUE, 
	proximity=TRUE, ntree=5000)


par(mfrow=c(2,1))
par(pty="s")

# start a plot, and plot the variable importance plot
plot()
varImpPlot(rfAll, type=1, pch=19, col=1, cex=.5, main="")
varImpPlot(rfAll, type=2, pch=19, col=1, cex=.5, main="")

# write the variable importance to a file that can be read into excel
fileOut <- file("rf.txt", "w")
imp <- importance(rfAll);
write.table(imp, fo, sep="\t")
flush(fo)
close(fo)

References

Classification and Regression by randomForest by Andy Liaw and Matthew Wiener in R News (ISSN 1609-3631).

The randomForest manual by Andy Liaw and Matthew Wiener at r-project.org