The data

It would be best if you followed along with this exercise using using your project data. The only requirement is that it needs to have at least two quantitative variables. For illustration, I’m going to work with the DeRisi et al. dataset.

Load your data here:

# Enter your code here!

For illustration, I’m going to call the data.frame of log2 transformed, normalized, expression ratios data. You should name your variable something more useful. You should check to make sure your table looks good.

##   derisi.05.5.txt derisi.09.5.txt derisi.11.5.txt derisi.13.5.txt
## 1       0.1647863       0.2714257      0.04684025      0.42867841
## 2       0.1003049       0.2363395     -0.10935876      0.14925937
## 3       0.1492594       0.4688439     -0.01449957      0.14404637
## 4      -1.1909972              NA     -0.96296927     -0.51045706
## 5      -0.7417826       0.2265085     -0.98564471     -0.04841221
## 6      -0.3510744       0.4254593     -0.57346686     -0.52076944
##   derisi.15.5.txt derisi.18.5.txt derisi.20.5.txt
## 1      0.14274017      -0.2108968       0.1826923
## 2      0.11103131      -0.1408255       0.2521734
## 3      0.07176267       0.1150332       0.2141248
## 4     -2.39592868              NA      -0.9378783
## 5     -1.27578631      -0.1942948              NA
## 6      0.40490312       0.3196179       0.2485348

Explore the raw data

As we saw in the previous microarray exercises, it’s always a good idea to start with a quick look at the distribution of raw values in the data set you’re working with. In this lab, I’ll be showing example output plots, but leave it to you to work out an implementation strategy.

We can either use the built in density and plot functons, or the ggplot2 density geom to visualize the overall distribution of value.

For example:

## Use suppressPackageStartupMessages to eliminate package startup messages.

ggplot(gather(data, key = "array", value = "log2.ratio")) +
  geom_density(aes(log2.ratio, color = array))

With your data:

# Enter your code here!

We can clearly see that the data have been preprocessed resulting in a distribution of expression values centered on a log2 of 0. This has implications for how we will call our hierarchical clustering function, because data (usually) need to be centered and (often) scaled before clustering.

Create a hierarchical cluster

A hierarchical cluster is an easy way to get an initial peek at the structure of a large dataset. The base function for creating a hierarchical cluster with an accompanying heatmap of the raw data matrix is heatmap.

Here are a few tricks for tweaking the clusters you make using the builtin heatmap function:

heats <- colorpanel( n    = 20
                   , low  = "royalblue3"
                   , mid  = "black"
                   , high = "yellow"

Make a cluster using heatmap!


m <- as.matrix(data)

#Heat map requires that there be no missing values, so we'll prune rows with NA's
m <- na.omit(m)

# Depending on the size of your data, this might take awhile
heatmap(m, col = heats, scale = "none", labRow="", labCol="",  zlim = c(-4, 4), na.rm = TRUE)

Your data:

# Enter your code here!

What does the structure of the cluster for the DeRisi data demonstrate about changes in gene expression over the diauxic shift time course? What does your cluter show?

Correlation matrix

A second approach to trying to find patterns in a complex data set is to calculate a correlation matrix from the raw data and then apply pattern recognition to the correlation matrix instead of the raw data. In the DeRisi case a correlation matrix could describe how similar the gene expression profiles are between the time points (column correlations) OR how similar genes across the time points (row correlations).

Fortunately, calcuating a correlation matrix is a common enough task that that aptly named cor function does it by default when passed a matrix or data.frame. Check the help file for options for the use argument.

For example:

interactionMatrix <- cor(m, use = 'pairwise.complete.obs')

Try it:

# Enter your code here!

You’ll find it’s pretty hard to visualize the distribution of values in a large matrix like this; but we already know how to solve this problem: a heatmap!

You can use the image function or use the geom_tile geom from ggplot2. ggplot expects all data to be formated in a data.frame (and we have a matrix). You can use tidyr functions to gather your matrix.

For example:

interactionTable            <- reshape2::melt(interactionMatrix)
colnames(interactionTable)  <- c( "array.1"
                                , "array.2"
                                , "value"

ggplot( interactionTable
      , aes(array.1, array.2, fill = value)
      ) + geom_tile() + labs(x = "Arrays", y = "Arrays")

# Enter your code here!

As expected, when the rows and columns are unordered there aren’t many obvious groups (although, you’ll notice that the order of genes in the matrix wasn’t totally random!).

Now experiment!