In doing the data analysis for this EMAP paper a hand-curated set of functional annotations was assembled for all of the 552 mutants in the collection. This set of annotations started with the GO categories, but was manually “cleaned up” based on decades of work in the mRNA processing field. I’ve provided a file containing these annotations, ordered according to the mutant axes in the genetic interaction data file:
annotations <- read.table( "data/rnaworld-annotations.txt" , header = TRUE , quote = "\"" ) levels(annotations$category)
##  "Chromatin Remodeling" "Mitochondrial " ##  "mRNA Export" "mRNA Processing" ##  "mRNA Splicing" "Nuclear Pore" ##  "Other" "RNA Degradation" ##  "rRNA Processing" "rRNA Processing " ##  "Transcription" "TRANSCRIPTION" ##  "TRANSLATION" "tRNA Metabolism" ##  "tRNA Metabolism " "Ubiquitin/Protein Degradation" ##  "UNKNOWN"
You’ll see we boiled down the gene categories into a few very high-level bins.
Principal component analysis or PCA is a common data exploration strategy that you’ll see used often in systems biology. In our correlation matrix from last week, we saw that there are, in fact, some highly correlated sets of mutants (CFT1 & SNU23, for example). A “principal component” is a vector that describes covariance (corresponding variation) in all/most/some of the primary data vectors (here our genetic interaction profiles or correlation profiles). A “good” principal component is highly co-variant with a subset of the data and is NOT correlated with other “good” principal components. Thus, PCA is one way of “simplifying” your data down from a set of 552 unique deletion behaviors to a smaller set of principal components that describe most of the variation in the original dataset.
There are several ways of implementing PCA (most popular are probably eigenvalue decomposition and single value decomposition), none of which work with missing data (sparse matrixes). We have lots of missing data values, so we’ll start by performing PCA on our correlation matrix; but it’s important to remember that in so doing we are analyzing covariation in correlation profiles, not the raw genetic interaction scores themselves.
princomp functions both perform PCA (with different procedures):
interactionMatrix <- cor(emap, use = 'pairwise.complete.obs') pcs <- prcomp(interactionMatrix)
You can see a
summary of all of the possible principal components:
We always have as many possible principal components as input vectors, but the “usefulness” (proportion of variance explained by each) diminishes at the higher component numbers. You can
plot the top 10 (this is called a scree plot):
What do we see here? How do we know which, if any, of these principal components are useful decompositions of our data? Let’s take a look at whether or not any of the top PCs effectively describe variation accross our functional groups (see
?facet_wrap to make sure you follow along below):
library(ggplot2) library(reshape2) annotatedPCs <- cbind( category = annotations$category , melt(pcs$rotation[,1:9]) ) orderdPCs <- annotatedPCs[ order( annotatedPCs$X2 , annotatedPCs$category , annotatedPCs$X1 ) , ] orderdPCs$geneID <- rep(1:552, times = 9) ggplot(orderdPCs, aes(geneID, value, fill=category)) + geom_bar(stat = 'identity') + facet_wrap(~X2)
What do we see here? There are some regions in which PC values are mapping to the functional categories (for example, PC1 mostly discriminates chromatin from the mitochondria), but the effects are moderate. Try repeating this procedure to ask if any of the PCs summarize differences between deletion and DAMP strains (mostly to prove to yourself you followed along above!).
One complicating factor here is that we are assessing covariance in the correlation profiles rather than in the raw data themselves. One (sometimes sketchy) approach to dealing with missing data upstream of PCA is to use data “imputation”. There are many imputation strategies, all of which amount to making-up data for missing values. We should not be terribly surprised when a model is used to impute data and then it is discovered that the data fit nicely to a model. However, it’s worth taking a pass at it with our raw data.
PCA function in the
FactoMineR package performs imputation by default if you have missing values:
Explore PCA with this package and the raw data values. Are the results more compelling?
PCA can be quite useful for higher order analysis as PCs can be used in any of the downstream applications where raw data vectors would have been, including clustering and network building.
There are MANY different approaches to building network topologies from raw data and correlation matrixes. Just a few Baysian approaches are described on the CRAN Task-views page.
In systems biology, the Markov clustering algorithm, or MCL, is very much en vogue because of its spead and generally good performance with biological data. Here, we’ll walk through performing MCL and drawing the resulting network, which will also give us the opportunity to see how easy it is to use external data software as part of an analysis project in R.
Currently there are a number of software pacakges that implement MCL (including one with GPU compute support), but the original was published as a stand-alone command line tool.
From the shell, you can download the source code and compile the
mcl command line program very easily:
wget http://micans.org/mcl/src/mcl-12-068.tar.gz tar xzf mcl-12-068.tar.gz cd mcl-12-068 ./configure --prefix=$HOME/local make install
By default, compiled binaries will be installed in your
~/local/bin directory, so you can make sure the compilation went well by running this shell command:
You can run external programs (or exectue any shell command) from within an R session using the
system function. For example:
This makes it very easy to chain calls to external tools into your R data analysis scripts.
From the main manual page for the MCL command line tool, we can see that this program expects an input file with three columns (unnamed):
weight. Let’s start by using our pearson correlation matrix for weights:
write.table( melt(interactionMatrix) , "interaction-matrix.txt" , col.names = FALSE , row.names = FALSE , quote = FALSE )
We can then run
mcl on this file, specifying a new file name to output results:
system("~/local/bin/mcl interaction-matrix.txt --abc -o mcl.results.txt")
Explore the structure of this results file!
Here are some things to try out (either with your own data, or with this EMAP data):
Play with the
inflation parameter using the -I switch to see how this affects the number and organization of the clusters you get back.
Explore different mechanisms for visualizing the network of clusters you get out of the MCL algorithm. See the
igraph package for useful network graph drawing tools.
Explore alternative network building algorithms listed on the CRAN task views page.