Using the search features on the GEO website, find the DeRisi 1997 yeast diauxic shift dataset.
If you’re having trouble getting the tables to load on GEO today, here are local links:
(Hint: if the series is not GSE28, you’re on the wrong record)
platformhere refers to the microarray design itself. Explore the information that is annotated in the
read.delimwon’t be able to parse these lines.
Download microarray data corresponding to the pre-diauxic shift samples (0’ time point) and at least one other time point of your choosing.
The two files you just downloaded hold the results of a microarray image analysis, along with the results of the data pre-processing done for the published paper.
Answer the following questions for yourself about the structure of the data:
rownameswhen it was loaded?
Use R to load the data in these tables into data.frame structures. Think carefully about how you are naming your variables and keeping track of your work. Unlike the genome annotation lab, I won’t be providing templates for you to follow here, so its up to you to partition your code into a logical and organized set of functions!
In this data file the authors have done some of the data pre-processing work for us – namely, normalizing the ratios – but let’s pretend they haven’t!
Using ONLY the data in the
CH2D_MEAN columns, add a column to your data structure that calculates the ratio of the Cy5 (CH1) to Cy3 (CH2) samples for each spot on the array.
(Hint: you may have to convert the numbers read from the file from strings to integers.)
Add another column that converts these plain ratios to the Log2(ratio). Think carefully about why a Log2 transformation of the raw image ratios makes sense before going on to the next section.
At this point, it would make sense for us to look at the distribution of the ratio values on our array. One common problem with competitive hybridization microarrays (two-color arrays) is that a sample with one of the dies has often been loaded at a slightly higher concentration than the other. This is usually do to subtle pipetting or cDNA concentration calculation errors. Examining a distribution of ratio values on our array should help us identify any bias towards the green or red end of the color spectrum.
Functions which might prove useful to you are the ggplot2
geom_boxplot functions. Use R to produce at least two different plots that allow us to visualize the distribution of ratio values.
For this excercise will you want to be working with the raw ratio values or the Log2 transformed values? Why?
Save these plots in thoughtfully named files. Do you see a ratio bias in these data?
How do the ratios compare if you use
_MEDIAN pixel intensities instead of
To help us to better understand the distributions we see above, it might be useful to consider a couple additional strategies for visualizing these data.
First, it would be interesting to examine a scatter plot comparing the red channel pixel intensity to the green channel pixel intensity for each individual spot on the array. Produce this scatter plot (in ggplot2
geom_points). Make sure your plot includes clearly labeled axes (
What can we learn about the overall distribution of ratios on the array from this plot? On average are the changes large or small? Are many genes affected or only a small number?
Second, we should also examine whether or not the total intensity of a spot is related to the ratio bias observed above. Make a scatter plot aimed at addressing this question (compare the color ratio of each spot to its total intensity). You’ll have to decide how you want to calculate the total intensity for spot.
Was there an intensity bias? Why might this be of a concern to us? What might this tell us about the behavior of the dies used to label the two samples on this array?
So far we’ve been ignoring background pixel intensities. These values are a measure of the brightness in the regions surround each spot (the “background”). Background is obviously noise and doesn’t correspond to cDNA’s binding specifically to probes on the array.
Produce a visualization to help us asses how variable the background intensities are across the surface of the array. One great way to do this would be to produce a heat map, where each box corresponds to a spot position on our array and the color corresponds to the intensity of the background at that spot.
This can be done with the
heat.colors functions. Explore the way these functions work using some sample data. Think about the structure of the data you want to pass into
image and how you can make a matrix conforming to this architecture using your existing data.frames.
Are the background intensities uniform or are some much higher than others? Where are the problems the greatest?
How should we use background intensity information?
Knowing what you know now about your data, design a way to add a normalized ratio column to your data structure. Implement it, run it, and check to make sure it worked!
Up to this point we’ve been looking at our data one microarray table at a time. As we start exploring the data it would be more convenient to have a new table which contains just the set of final normalized ratio values for each array. Let’s list arrays across the columns and genes down the rows.
You also may have noticed that the array data files we downloaded from GEO don’t tell us which row is associated with which yeast gene – instead it gives us an ID_REF column. The IDs in this column match up with the IDs in the platform table (see step #1). Since we’re building a new table at this point it makes sense to pull in the gene annotations at this point as well from the
Implement a function to do it! Your function will probably need to take multiple source array data tables, know which column to pull data from and take a platform table.
Once you have a data.frame will all of the ratio data from each microarray loaded and annotated, it’s time to start exploring the data set.
Here are some challenges to try:
Produce a list of genes which showed expression level changes > 2x.
Produce a hierarchical cluster. You’ll want to explore the
heatmap function. Note, the
heatmap function wants you to pass in data as a matrix (rather than a data.frame). You can convert your data to a matrix with the
as.matrix function. Think carefully about what you want to do about missing data…
Experiment with clustering only the rows (genes) or only the columns (time points). You can also perform clustering alone using the
hclust functions (explore the output of these functions).
cutree function to “trim” your gene dendrogram into a set of significant groupings using k-means clustering (see also,
RowSideColors argument to the heatmap function to highlight genes that share a common GO annotation (the easiest way to pull in GO annotations is from the “slim” set at SGD: