Although you’ve made a good start at writing your own genome annotation program, as you might expect, a number of useful genome annotation resources already exist for the yeast genome. Have a look at the data sets available on the yeast genome website under Curated Data.
Files ending in “.tab” are “tab” (or “”) delimited and can be easily loaded in R using the read.delim function.
Here are two data sets of particular note:
SGD_features.tab: A more comprehensive version of the feature annotation table you’ve been working on building.
go_slim_mapping.tab: A currated set of “Gene Ontology” (GO) terms, describing the process, function and localization of each gene product encoded in the yeast genome. The “slim” data set maps each protein/function RNA to a single “term”; the fill GO go mapping dataset allows single proteins/RNAs to be mapped to multiple terms.
Your goal for the rest of lab today is to ask an interesting question about the yeast genome that can be answered by analyzing the data you find in one or more of the curated data sets on SGD. By the end of lab try to have one data visualization (or more) that answer’s that question!
In the section below I’ll introduce a very powerful graphis package for R that you will probably want to use to explore different visualization strategies.
Although the base graphics capabilities provided by R are extremely flexible and powerful, the defaults visual paramters and programming interface are pretty ugly and outdated. Hadley Wickham’s
ggplot2 package, inspired by Leland Wilkinson’s call for a grammer of graphics, is a far better easier platform to work with and is a better choice new commers to R.
So today, we’ll focus on introducing plotting in R using the
ggplot2 package instead of the base
plot(...) function. Hopefully you’ll be excited by how easy it is to create some stunning plots!
If it isn’t already available, you can install the
ggplot2 package by running
install.packages("ggplot2"). Once the package is installed, loading ggplot2 for use in a script or at the console can be done using the
Here, I’ll demonstrate some example uses of
ggplot2, but for many more ideas of what you can do with the library see the full documentation: docs.ggplot2.org.
For these examples we’ll use the built in
diamonds dataset. See
?diamonds for detailed information about this sample data.
The general workflow for constructing plots using
ggplot2 will be to first describe the structure of the data that you want to visualize and then apply visualization functions to that structure. Although this coding idiom, and the syntax that has been designed to support it, might feel strange at first you will soon come to appreciate the elegance of separating your description of what data you want to visualize from the particulars of how you want to draw that visualization. This separation will feel very strange at first if you are coming from other plotting/graphing platforms which generally conflate these two concerns. However, as you walk through the following examples hopefully the power and flexibility of this design pattern will become clear!
The first step in producing a
ggplot2 visualization is to initialize a data structure that describes the (1) source of the data we want to work with and (2) the relationship between variables that we want to visualize (this is called the “aesthetic mapping” in
# Here's the structure of the diamonds dataset head(diamonds)
## carat cut color clarity depth table price x y z ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
# Construct a ggplot object to visualize the relationship between clarity and depth p <- ggplot(diamonds, aes(clarity, depth))
aes function in the above block sets up the structure of our visualization: we’re interested in considering depth (quantitative) as a function of clarity (qualitative). The
p object doesn’t yet hold any information about how this relationship should be visualized, only the structure of what is being visualized.
You can use the
+ operator to add visualization functions to this underlying structure to actually produce plots. The data projection functions all start with
geom_. For example, if we want points on our plot we use
p + geom_point()
Switching to a different data projection, like a boxplot, is as simple as “adding” a different
geom_ to our original data structure:
p + geom_boxplot()
As you can see, keeping with the spirit of the R environment,
ggplot2 is designed to make it easy to rapidly iterate over different visualization and analysis approaches as you work interactively in the console. Simply take your underlying
ggplot2 object describing data relationships and apply different visualization techniques to find one that suites your needs!
Above we saw two examples of how we visualize a quantitative variable grouped by a catagorical variable. When you begin exploring new datasets it is often quite useful to start with a focused look at the distribution of the data along each of your unique quantitative variables. Here are some examples:
# Visualize the distribution of the depth variable p2 <- ggplot(diamonds, aes(depth)) # Histogram p2 + geom_histogram()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
# Density distribution p2 + geom_density()
There are other kinds of functions that can be applied to ggplot objects to tweak the visualization. For example, to set the axis limits on the above plot we can apply the
# Set x-axis limits p2 + geom_density() + xlim( c(55, 65) )
## Warning: Removed 857 rows containing non-finite values (stat_density).
When we run this code, we get a warning telling you that these limits have caused some data to be excluded from your plot. There are a number of useful sanity checks built into
ggplot2 that warn you if your visualization parameters are running the risk of producing misleading results!
Of course, we can also visualize the relationship between two quantitative variables using a scatter plot:
# Visualize depth as a function of price p3 <- ggplot( diamonds, aes(price, depth) ) # Use points (scatter plot) p3 + geom_point()
In your aethetic mappings, you can describe more than just the initial structure of the data projection. For example, let’s say that we want to visualize the categorical variable
clarity, when comparing
# Visualize depth as a function of price; use color to visualize clarity label p4 <- ggplot( diamonds, aes(price, depth, color = clarity) ) # Use points (scatter plot) p4 + geom_point()
For visualizations that use solid objects, you can also specify a
p5 <- ggplot(diamonds, aes( price, fill = clarity) ) # Setting the alpha channel to 50% makes the density plots transparent p5 + geom_density(alpha = 0.5)
There are many more options for visualizations schemes (
geom_ functions), for mapping annotation variables onto additional visualization parameters (ex:
alpha), and for manipulating axes, labels and legends (ex:
See the excellent ggplot2 documentation for the full list of functions with useful examples: docs.ggplot2.org.
Now, make some pretty graphs!