# Step 1: The source data

There are two major public repositories of microarray datasets: GEO (NCBI) and ArrayExpress (EMBL). Today we’ll be using data published on GEO.

Using the search features on the GEO website, find the DeRisi 1997 yeast diauxic shift dataset. (Hint: if the series is not GSE28, you’re on the wrong record).

• Take a moment to look at how information about microarray experiments is annotated in this database. What information is recorded in this database? How is it organized?
• Examine the Platform record. The term platform here refers to the microarray design itself. Explore the information that is annotated in the platform record.
• Explore the structure of the Sample records. You can use the “View full table” button to get a view of all of the data. What is the structure of these records? What kinds of data does each contain?

After fetching (bio297::fetch()) this exercise you’ll have copies of the sample and and platform files in your “data/” folder.

# Step 2: Look at the format of the table

Looking over the raw text files, consider the following questions for yourself about the structure of the data:

• There are two sections to these text files, a header and a data table. What information is held in each?
• What do the rows of the data table represent?
• What information about each row does each of the columns hold? Is there anything odd here?
• What is the text format of this table? How are rows separated? How are elements within each row separated?
• What is in the ID_REF column? Did you pick up the data in this column for your data.frame rownames when it was loaded? Why might this be good to do?

Use R to load the data in these tables into data.frame structures. Think carefully about how you are naming your variables and keeping track of your work. Unlike the genome annotation lab, I won’t be providing templates for you to follow here, so its up to you to partition your code into a logical and organized set of functions!

# Enter your code here!


I’m going to name the data frame for the first timepoint ds.0. Look over the data.frame you loaded carefully, cleaning it up where needed. Remember to always double check your data types.

head(ds.0)
##   CH1B_MEAN CH1B_MEDIAN CH1D_MEAN CH2I_MEAN CH2D_MEAN CH2B_MEAN
## 1      2720        2608      7075      8925      7613      1424
## 2      2687        2600      7768      9300      7996      1381
## 3      2630        2584      4575      6168      4872      1327
## 4      2629        2568        64      1315        27      1340
## 5      2639        2568       662      1668       380      1345
## 6      2628        2576       481      1650       362      1334
##   CH2B_MEDIAN CH2BN_MEDIAN CH2DN_MEAN CH2IN_MEAN  CORR FLAG  VALUE
## 1        1312         1367       7930       9297 0.000    0  0.165
## 2        1304         1358       8330       9688 0.000    0  0.101
## 3        1296         1350       5075       6425 0.000    0  0.150
## 4        1288         1342         28       1370 0.151    0 -1.193
## 5        1288         1342        396       1738 0.927    0 -0.741
## 6        1288         1342        377       1719 0.914    0 -0.351
##   PIX_RAT2_MEDIAN PERGTBCH1I_1SD PERGTBCH2I_1SD RAT1_MEAN RAT1N_MEAN
## 1           1.090             27             36     0.929      0.892
## 2           0.986             31             36     0.971      0.933
## 3           1.047             27             31     0.939      0.901
## 4           1.000              0              0     2.370      2.286
## 5           0.605             17             18     1.742      1.672
## 6           0.818              7             11     1.329      1.276
##   RAT2_MEAN RAT2N_MEAN  REGR TOT_BPIX TOT_SPIX TOP BOT LEFT RIGHT
## 1     1.076      1.121 0.000     1254       80  21  31   50    60
## 2     1.029      1.072 0.000     1074       80  21  31   61    71
## 3     1.065      1.109 0.000     1030       80  21  31   72    82
## 4     0.422      0.438 0.144     1030       80  21  31   83    93
## 5     0.574      0.598 0.598     1030       80  21  31   94   104
## 6     0.753      0.784 0.637     1030       80  21  31  105   115
##   UNF_VALUE
## 1     0.165
## 2     0.101
## 3     0.150
## 4    -1.193
## 5    -0.741
## 6    -0.351
summary(ds.0)
##    CH1B_MEAN     CH1B_MEDIAN     CH1D_MEAN       CH2I_MEAN
##  Min.   :1580   Min.   :1560   Min.   :    1   Min.   :  803
##  1st Qu.:2088   1st Qu.:2064   1st Qu.: 1406   1st Qu.: 2430
##  Median :2385   Median :2328   Median : 2325   Median : 3456
##  Mean   :2434   Mean   :2317   Mean   : 3925   Mean   : 5184
##  3rd Qu.:2713   3rd Qu.:2568   3rd Qu.: 4266   3rd Qu.: 5557
##  Max.   :7582   Max.   :3304   Max.   :51968   Max.   :55149
##                                NA's   :47
##    CH2D_MEAN       CH2B_MEAN     CH2B_MEDIAN    CH2BN_MEDIAN
##  Min.   :    3   Min.   : 826   Min.   : 816   Min.   : 850
##  1st Qu.: 1240   1st Qu.:1144   1st Qu.:1112   1st Qu.:1158
##  Median : 2250   Median :1313   Median :1248   Median :1300
##  Mean   : 3986   Mean   :1350   Mean   :1220   Mean   :1271
##  3rd Qu.: 4324   3rd Qu.:1467   3rd Qu.:1328   3rd Qu.:1383
##  Max.   :53893   Max.   :6512   Max.   :1976   Max.   :2058
##  NA's   :35
##    CH2DN_MEAN      CH2IN_MEAN         CORR              FLAG
##  Min.   :    3   Min.   :  836   Min.   :-1.9870   Min.   :-100.0000
##  1st Qu.: 1292   1st Qu.: 2531   1st Qu.: 0.0000   1st Qu.:   0.0000
##  Median : 2344   Median : 3600   Median : 0.9280   Median :   0.0000
##  Mean   : 4152   Mean   : 5400   Mean   : 0.6623   Mean   :  -0.3594
##  3rd Qu.: 4505   3rd Qu.: 5789   3rd Qu.: 0.9610   3rd Qu.:   0.0000
##  Max.   :56139   Max.   :57447   Max.   : 3.1840   Max.   :   0.0000
##  NA's   :35
##      VALUE          PIX_RAT2_MEDIAN  PERGTBCH1I_1SD   PERGTBCH2I_1SD
##  Min.   :-2.88300   Min.   :0.2140   Min.   :  0.00   Min.   :  0.00
##  1st Qu.:-0.16800   1st Qu.:0.8510   1st Qu.: 35.00   1st Qu.: 37.00
##  Median :-0.00100   Median :0.9570   Median : 48.00   Median : 53.00
##  Mean   :-0.03209   Mean   :0.9498   Mean   : 51.17   Mean   : 55.48
##  3rd Qu.: 0.14600   3rd Qu.:1.0500   3rd Qu.: 68.00   3rd Qu.: 75.00
##  Max.   : 4.85800   Max.   :2.0000   Max.   :100.00   Max.   :100.00
##  NA's   :71
##    RAT1_MEAN       RAT1N_MEAN      RAT2_MEAN         RAT2N_MEAN
##  Min.   :0.036   Min.   :0.034   Min.   : 0.1280   Min.   : 0.136
##  1st Qu.:0.941   1st Qu.:0.903   1st Qu.: 0.8550   1st Qu.: 0.890
##  Median :1.042   Median :1.000   Median : 0.9600   Median : 1.000
##  Mean   :1.091   Mean   :1.047   Mean   : 0.9651   Mean   : 1.005
##  3rd Qu.:1.170   3rd Qu.:1.123   3rd Qu.: 1.0630   3rd Qu.: 1.107
##  Max.   :7.800   Max.   :7.378   Max.   :28.0000   Max.   :29.000
##  NA's   :50      NA's   :50      NA's   :50        NA's   :50
##       REGR            TOT_BPIX         TOT_SPIX       TOP
##  Min.   :-1.6740   Min.   : 595.0   Min.   :80   Min.   : 21.0
##  1st Qu.: 0.0000   1st Qu.: 661.0   1st Qu.:80   1st Qu.:241.2
##  Median : 0.7855   Median : 669.0   Median :80   Median :463.0
##  Mean   : 0.6152   Mean   : 696.3   Mean   :80   Mean   :462.9
##  3rd Qu.: 0.9310   3rd Qu.: 702.0   3rd Qu.:80   3rd Qu.:685.2
##  Max.   : 4.8780   Max.   :1295.0   Max.   :80   Max.   :906.0
##
##       BOT             LEFT           RIGHT         UNF_VALUE
##  Min.   : 31.0   Min.   : 49.0   Min.   : 59.0   Min.   :-2.8830
##  1st Qu.:251.2   1st Qu.:269.2   1st Qu.:279.2   1st Qu.:-0.1670
##  Median :473.0   Median :491.5   Median :501.5   Median : 0.0000
##  Mean   :472.9   Mean   :492.2   Mean   :502.2   Mean   :-0.0310
##  3rd Qu.:695.2   3rd Qu.:716.0   3rd Qu.:726.0   3rd Qu.: 0.1467
##  Max.   :916.0   Max.   :936.0   Max.   :946.0   Max.   : 4.8580
##                                                  NA's   :50

# Step 3: Calculating a ratio

In this data file the authors have done some of the data pre-processing work for us – namely, normalizing the ratios – but let’s pretend they haven’t!

Using ONLY the data in the CH1D_MEAN and CH2D_MEAN columns, add a column to your data structure that calculates the ratio of the two channels. The convention is to express ratios as ( experiment / control ). Figure out which channel should be considered experimental and which the control given the design of this experiment (hint: we want to answer the question “How does gene expression change over the diauxic shift timecourse?”).

Now, plot the distribution of these values. Functions which might prove useful to you throughout this exercise are the ggplot2 geom_density, geom_histogram, or geom_boxplot functions. Use R to produce at least two different plots that allow us to visualize the distribution of ratio values.

What’s the problem with this distribution? Why does it have the shape it does? Add another column that converts these plain ratios to the log2(ratio). The convetion in gene expression bioinformatics is to call the log2( experiment / control ) “m-values.”

# Step 4: Distribution of ratio values

At this point, it would make sense for us to look at the distribution of the ratio values on our array. One common problem with competitive hybridization microarrays (two-color arrays) is that a sample with one of the dies has often been loaded at a slightly higher concentration than the other. This is usually do to subtle pipetting or cDNA concentration calculation errors. Examining a distribution of ratio values on our array should help us identify any bias towards the green or red end of the color spectrum.

For this excercise will you want to be working with the raw ratio values or the Log2 transformed values? Why? Do you see a ratio bias in these data?

How do the ratios compare if you use _MEDIAN pixel intensities instead of _MEAN intensities?

# Step 5: Distribution of ratio values, Part 2

To help us to better understand the distributions we see above, it might be useful to consider a couple additional strategies for visualizing these data.

First, it would be interesting to examine a scatter plot comparing the red channel pixel intensity to the green channel pixel intensity for each individual spot on the array. Produce this scatter plot (in ggplot2 geom_points). Make sure your plot includes clearly labeled axes (labs, xlab, and ylab).

What can we learn about the overall distribution of ratios on the array from this plot? On average are the changes large or small? Are many genes affected or only a small number?

Second, we should also examine whether or not the total intensity of a spot is related to the ratio bias observed above. Make a scatter plot aimed at addressing this question (compare the color ratio of each spot to its total intensity). You’ll have to decide how you want to calculate the total intensity for spot.

Was there an intensity bias? Why might this be of a concern to us? What might this tell us about the behavior of the dies used to label the two samples on this array?