Step 1: Load factor annotations

In doing the data analysis for this EMAP paper a hand-curated set of functional annotations was assembled for all of the 552 mutants in the collection. This set of annotations started with the GO categories, but was manually “cleaned up” based on decades of work in the mRNA processing field. I’ve provided a file containing these annotations, ordered according to the mutant axes in the genetic interaction data file:

annotations <- read.table( "data/rnaworld-annotations.txt"
                         , header = TRUE
                         , quote = "\""
                         )

levels(annotations$category)
##  [1] "Chromatin Remodeling"          "Mitochondrial "               
##  [3] "mRNA Export"                   "mRNA Processing"              
##  [5] "mRNA Splicing"                 "Nuclear Pore"                 
##  [7] "Other"                         "RNA Degradation"              
##  [9] "rRNA Processing"               "rRNA Processing "             
## [11] "Transcription"                 "TRANSCRIPTION"                
## [13] "TRANSLATION"                   "tRNA Metabolism"              
## [15] "tRNA Metabolism "              "Ubiquitin/Protein Degradation"
## [17] "UNKNOWN"

You’ll see we boiled down the gene categories into a few very high-level bins.

Step 2: Principal Component Analysis

Principal component analysis or PCA is a common data exploration strategy that you’ll see used often in systems biology. In our correlation matrix from last week, we saw that there are, in fact, some highly correlated sets of mutants (CFT1 & SNU23, for example). A “principal component” is a vector that describes covariance (corresponding variation) in all/most/some of the primary data vectors (here our genetic interaction profiles or correlation profiles). A “good” principal component is highly co-variant with a subset of the data and is NOT correlated with other “good” principal components. Thus, PCA is one way of “simplifying” your data down from a set of 552 unique deletion behaviors to a smaller set of principal components that describe most of the variation in the original dataset.

There are several ways of implementing PCA (most popular are probably eigenvalue decomposition and single value decomposition), none of which work with missing data (sparse matrixes). We have lots of missing data values, so we’ll start by performing PCA on our correlation matrix; but it’s important to remember that in so doing we are analyzing covariation in correlation profiles, not the raw genetic interaction scores themselves.

The built-in prcomp and princomp functions both perform PCA (with different procedures):

interactionMatrix <- cor(emap, use = 'pairwise.complete.obs')

pcs <- prcomp(interactionMatrix)

You can see a summary of all of the possible principal components:

summary(pcs)

We always have as many possible principal components as input vectors, but the “usefulness” (proportion of variance explained by each) diminishes at the higher component numbers. You can plot the top 10 (this is called a scree plot):

plot(pcs)

plot of chunk unnamed-chunk-5

What do we see here? How do we know which, if any, of these principal components are useful decompositions of our data? Let’s take a look at whether or not any of the top PCs effectively describe variation accross our functional groups (see ?cbind, ?prcomp and ?facet_wrap to make sure you follow along below):

library(ggplot2)
library(reshape2)

annotatedPCs <- cbind( category = annotations$category
                     , melt(pcs$rotation[,1:9])
                     )

orderdPCs         <- annotatedPCs[ order( annotatedPCs$X2
                                        , annotatedPCs$category
                                        , annotatedPCs$X1
                                        )
                                 ,
                                 ]

orderdPCs$geneID  <- rep(1:552, times = 9) 

ggplot(orderdPCs, aes(geneID, value, fill=category)) + 
  geom_bar(stat = 'identity') + 
  facet_wrap(~X2)

plot of chunk unnamed-chunk-6

What do we see here? There are some regions in which PC values are mapping to the functional categories (for example, PC1 mostly discriminates chromatin from the mitochondria), but the effects are moderate. Try repeating this procedure to ask if any of the PCs summarize differences between deletion and DAMP strains (mostly to prove to yourself you followed along above!).

One complicating factor here is that we are assessing covariance in the correlation profiles rather than in the raw data themselves. One (sometimes sketchy) approach to dealing with missing data upstream of PCA is to use data “imputation”. There are many imputation strategies, all of which amount to making-up data for missing values. We should not be terribly surprised when a model is used to impute data and then it is discovered that the data fit nicely to a model. However, it’s worth taking a pass at it with our raw data.

The PCA function in the FactoMineR package performs imputation by default if you have missing values:

library(FactoMineR)
?PCA

Explore PCA with this package and the raw data values. Are the results more compelling?

PCA can be quite useful for higher order analysis as PCs can be used in any of the downstream applications where raw data vectors would have been, including clustering and network building.

Step 4: Markov cluster algorithm

There are MANY different approaches to building network topologies from raw data and correlation matrixes. Just a few Baysian approaches are described on the CRAN Task-views page.

In systems biology, the Markov clustering algorithm, or MCL, is very much en vogue because of its spead and generally good performance with biological data. Here, we’ll walk through performing MCL and drawing the resulting network, which will also give us the opportunity to see how easy it is to use external data software as part of an analysis project in R.

Currently there are a number of software pacakges that implement MCL (including one with GPU compute support), but the original was published as a stand-alone command line tool.

From the shell, you can download the source code and compile the mcl command line program very easily:

wget http://micans.org/mcl/src/mcl-12-068.tar.gz
tar xzf mcl-12-068.tar.gz
cd mcl-12-068
./configure --prefix=$HOME/local
make install

By default, compiled binaries will be installed in your ~/local/bin directory, so you can make sure the compilation went well by running this shell command:

~/local/bin/mcl --help

You can run external programs (or exectue any shell command) from within an R session using the system function. For example:

system("~/local/bin/mcl")

This makes it very easy to chain calls to external tools into your R data analysis scripts.

From the main manual page for the MCL command line tool, we can see that this program expects an input file with three columns (unnamed): node1, node2 and weight. Let’s start by using our pearson correlation matrix for weights:

write.table( melt(interactionMatrix)
           , "interaction-matrix.txt"
           , col.names  = FALSE
           , row.names  = FALSE
           , quote      = FALSE
           )

We can then run mcl on this file, specifying a new file name to output results:

system("~/local/bin/mcl interaction-matrix.txt --abc -o mcl.results.txt")

Explore the structure of this results file!

Challenges

Here are some things to try out (either with your own data, or with this EMAP data):