# Lab: Explorying Genomes

In this week’s labs we are going to perform a de novo annotation of the a yeast chromosome. We’ll be using R to find Open Reading Frames (ORFs) on yeast chromosomes, search for introns, and even annotate the locations of common transcription factor binding sites. Here’s an overview of each of the stages of today’s lab:

• Step 1: refine transcribe and translate. We’ll start by refining your earlier versions of the transribe and translate functions. We’re going to use these to help us write an ORF finder.

• Step 2: write an findORF function. Next, we’ll write a function which takes a string of amino acids and returns all of the possible true ORFs in that string. This function will accept a cutoff argument which filters out ORFs which are likely too small to be real.

• Step 3: write an annotateChromosome function which takes the name of a file containing genomic DNA sequence for a yeast chromosome and returns a data.frame where each row is a description of a different putative ORF on that chromosome.

• Step 4: glue everything together with an annotateGenome function that loads DNA sequence for each yeast chromosome and annotates it with annotateChromosome.

# Preface on Best Practices

## Keeping data and code organized: RStudio projects

One of the most important things that you can do to keep from making major mistakes or wasting a lot of time when working on a data analysis project is to give a little thought to how you are going to organize the code and data for individual projects.

RStudio has a great tool to help you do this: Projects. You can create a new project from File -> New Project, or from the drop down in the upper right hand corner. If you select “New project”, RStudio will create a blank subdirectory of your home folder with the name you provided. When a project is active RStudio sets the current working directory (default file path) to this folder. It also scopes all of your command history and environment variables to this project (by saving these data in this subfolder).

It’s a good idea to make one RStudio project for independent data analysis task you’re working on. Make a new project for today’s lab (you might call it “genome-annotation-lab”.

## Using functions to organize code

R features a very clear and relatively unambiguous syntax. This means that you, as an author of R code, have a lot of leeway to format your code in any way you see fit. Please use this feature of the language for good, not evil!

Your code should function as a document meant for other people to read; it is a narrative that tells a story about computation. That machines can parse and execute your code is a happy side effect of a well-designed language.

One of the most important language features of modern languages designed to help you organize your code is the function. You can declare your own functions in R, using the function syntax:

To create a function:

addTwo <- function(a, b) {
a + b
}

Here the function keyword created a function, which we then saved in the addTwo variable. Functions are “first class” in R meaning that they can be assigned to variables and passed around like any other kind of data.

The variables that you pass into a function are called its arguments; the result of calling a function is called its value. In R all functions have values. By default the value of a function is the result of the last computation it performed; there’s no explicit return statement in R. However, if you’d like to explicitly stop running a function and return a value before the end of the function code you can use the return() function.

Arguments to functions can have default values. For example:

addThree <- function(a, b = 1, c = 1) {
a + b + c
}
addThree(1)
## [1] 3
addThree(1, 2)
## [1] 4
addThree(1, c = 4)
## [1] 6

Also note, since function can be bound to variables, they can also be passed to other functions as arguments.

## Matters of Style (and style matters!)

You might notice that there are a couple discrepancies between these two guides (and my own coding style). I won’t strictly enforce one style, but I do ask that you follow two rules:

1. ALWAYS be consistent. For example, you may pick from any of the three variable naming conventions (my.variable, my_variable, or myVariable), but ALWAYS use the same convention throughout all of your code. I use the latter style for standard variable names (it’s called camel case).

2. Keep each line of code to 80 characters or less, using newlines to break things up. You can turn on an 80 character guide in RStudio from the Tools -> Global Options -> Code Editing pane.

Finally, I’ll admit a style quirk that I picked up when submitting code to a non-R code project a few years ago: the common-first style.

In R, arguments to functions are separated by commas, so we might create a named vector like:

myVar <- c(a = 10, b = 100, c = 12, d = 14, exz = 142, f = 1293, g = 0, h = NA)

If your list of arguments is long, or have complex expressions for their values, it is often nice to split each element onto it’s own line, like (notice the extra spacing to make all of the “=” signs line up nicely (as noted in the Google style guide):

myVar <- c(a    = 10,
b    = 100,
c    = 12,
d    = 14,
exz  = 142,
f    = 1293,
g    = 0,
h    = NA)

In the comma-first style, you … put the comma first:

myVar <- c( a    = 10
, b    = 100
, c    = 12
, d    = 14
, exz  = 142
, f    = 1293
, g    = 0
, h    = NA
)

Notice that the open parens ( commas , and close parens ) are all vertically aligned. Why would you do this? Well, try to find the syntax error in the following two code blocks:

myVar <- c(a    = 10,
b    = 100,
c    = 12,
d    = 14
exz  = 142,
f    = 1293,
g    = 0,
h    = NA)

myVar <- c( a    = 10
, b    = 100
, c    = 12
, d    = 14
, exz  = 142
f    = 1293
, g    = 0
, h    = NA
)

Which did you see first? The comma-first style is even more useful when working with nested data structures.

However, use of this style is still controversial, even in the community where it originated (javascript). But I found it solved two problems I frequently had: missing commas when rapidly iterating over different argument sets for function calls and missing close parens in nested call structures, so I adopted it when writing in languages that can support this syntax (it won’t work in Python because of the privileged nature of white-space). You’re welcome to use either style as long as you are consistent!

# Step 1: transcribe and translate

Let’s start off by creating a fresh script file. Use “File” > “New” > “R Script” in the Rstudio menus to open a blank script file. Save this file with the name genomeAnnotation.r.

We’re going to create a transcribe function with an interface that will meet our needs today. When we talk about a function’s interface, we mean (a) what the function expects you to pass into it as arguments, and (b) what the result of the function is, or it’s value.

Today we’re going to practice the good habit of using comment lines to keep a clear record of the interface for each of our functions, so that we don’t get confused.

Here’s a template for today’s version of your transcribe function should look like:

# Transcibes DNA sequence into RNA equivalent
#
# dna   A character string containing DNA sequence
#
# value A character string containing RNA sequence
#
transcribe <- function(dna) {

}

See how we’ve used the comment lines to clearly describe:

• What our function does (fist line)
• What kinds of data we expect for the arguments (dna)
• What kinds of data the function produces (value)

We’ll write descriptions like this for all of the functions we include in our genomeAnnotation.r script.

Once you’ve finished implementing a transcribe function, load your script into the interactive session (console) so that you can test it. You can load or source your script easily in couple of ways:

• Click on the source button on the right hand side of the script editor window
• Hit ctrl + shift + s
• Check the “Source on save” check box and then save you changes to the script by click on the save button (disk icon) or hitting ctrl + s

Now let’s test your transcribe function with some sample input:

transcribe("ATGCTTATCTA")

If you got a different result, go back and try to fix your function.

Add a translate function to genomeAnnotation.r that follows the following template:

# Translates RNA sequence into amino acid sequence
#
# rna   A character string containing RNA sequence
#
# value A character string containing amino sequence
#
translate <- function(rna) {

}

Here are some building blocks that you might find useful in building your translation function:

• substr
• paste
• sapply

This last one (sapply) will be a bit tricky to warp your head around at first. I’d recommend playing with it at the command line to get a feel for how function application works!

Note, we’ve explicitly stated that our function should return a single character string containing the amino acid sequence. Consider the difference between:

# Here are two example character vectors

# A vector of many elements; each element is only one character
c("V","R","S","P","L")
## [1] "V" "R" "S" "P" "L"
# A vector of one element; the element has many characters
"VRSPL"
## [1] "VRSPL"

What we want our translate function to produce is like the 2nd example, what it produces is like the first. So how can you take a character vector of many elements and turn it into a single-element vector of many characters?

This is a time when is the paste function is quite useful:

# Some example uses of the paste function

paste(c("A","B","C","D"), sep = "")
## [1] "A" "B" "C" "D"
paste(c("A","B","C","D"), sep = ", ")
## [1] "A" "B" "C" "D"

Once you’ve implemented a translate function, test it with some different with some different inputs:

translate("AUG")
translate("UUCUAAAUUAACAAAAUC")
translate("UUCUAAAUUAACAAAAU")
translate("UUCUAAAUUAACAAAA")

Did those last two lines fail for your function or did it work? If you got an error or an “NA” at the end of your result vector, let’s make our translate function a little bit smarter, so we don’t have to worry about it choking on input strings that aren’t evenly divisible by 3. The easiest thing to do is to just have it ignore incomplete final codons.

(Hint: check out the nchar function along with the modulo operator %%/)

When your function is working on all of the inputs above, you’re ready to move on to step 2!

# Step 2: Writing an ORF finder

## Function template

We now have the tools in place that we’ll need to write a function that finds ORFs. This function will take a string containing a sequence of amino acids and find all of the possible ORFs in that string. The logic of this function will be to identify stretches of amino acids that start with a methionine (“M”) and end with one of our stop codons (“X”). This function should also accept a ‘cutoff’ argument which will be used to filter ORFs that are too short to be believable.

Let’s start off by adding a template for the findORF function to our genomeAnnotation.r script.

# Returns a data.frame describing the ORFs found in the given aminoacid sequence
#
# aminoacids  A character string containing aminoacid sequence
# cutoff      An integer giving the minimum aminoacid length of valid ORFs
#
# value       A data.frame, coordinates refer to amino acid position
#
findORF <- function(aminoacids, cutoff = 100) {

}

So what’s the first thing this function is going to need to do? We want it to find stretches of amino acids that look like they correspond to real open reading frames. What’s the best way to look for small patterns in larger strings? The answer is regular expressions.

## Regular expressions

Regular expressions are a powerful tool available in many programming languages which are designed to find patterns in text. Regular expressions are a mini-programming language all of their own which allows us to describe many different types of complex patterns.

In this case the pattern we need to find is:

M...X

Where “M” is a methionine, “X” is one of our stop codons, and “…” represents a variable number of amino acids between M and X. Furthermore, we want to find ALL of the non-overlapping instances of the “M…X” pattern in our input amino acid sequence.

You can read about the full power of the regular expression syntax in the R help, but for today I’ll guide you through how to write them.

Our regular expression pattern is going to be the string:

"M(.*?)X"

Here’s what’s going on in the pattern:

• The “M” at the beginning says ‘look for patterns that start with M’
• The “X” at the end says ‘…and end with X’
• The “(…)” adds a subpattern that we’ll search for between the M and X
• “.” says ‘find any character that isn’t a space’
• The “*" says ‘find between 0 and an infinite number of those non-space characters’
• The “?” says ‘but make the stuff inside of the “(…)” as small as possible, while still conforming to my other rules’. If we didn’t add the “?” this pattern would match starting at the first M in our sequence – then gobble up all other M’s and X’s – until it hit the last X in the sequence. Obviously, not what we want!

All of the functions listed on the help page for ?sub can take regular expressions as their search patterns. Today we’ll use gregexpr (global regular expression) function.

When trying to learn how a new function works, it’s often useful to make up some example data that will make it easy to discover what a function is doing. Let’s make a sample “amino acid” sequence that will be easy to keep track of.

sampleAminoAcids <- "aaaaMbbbbXccccMddddXeeee"

See what I did there? If everything is working correctly we should find “MbbbbX” and “MddddX” and nothing else.

Let’s give gregexpr a whirl with our sample data and our regular expression:

gregexpr( pattern = "M(.*?)X", text = sampleAminoAcids )
## [[1]]
## [1]  5 15
## attr(,"match.length")
## [1] 6 6
## attr(,"useBytes")
## [1] TRUE

There’s a lot going with the output from this function – we’re starting to get a peak at some advanced R features. But notice that line that has the vector 5, 15? That suspiciously similar to the positions of our “M”’s in the sample string. See the line that has the vector 6, 6? Rather interesting, since both of our sample ORF sequences should be 6 amino acids long! As a novice user, you should be thinking to yourself “I think this is going to work for us”!

The structure being returned by gregexpr is called a list. It’s kind of like a basket that can hold other smaller types of data, like vectors. We’re not going to get distracted with the details today; instead let’s jump to how we can extract our ORF start positions and lengths from this structure.

Try the following:

matches         <- gregexpr(pattern = "M(.*?)X", text = sampleAminoAcids)
startPositions  <- as.vector(matches[[1]])
lengths         <- attr(matches[[1]], "match.length")

startPositions
## [1]  5 15
lengths
## [1] 6 6

If you’re interested in what the “[[1]]” is about: had our text argument been a vector with more than one string matches would have had additional elements accessible with [[2]]…[[n]]. If this detail stresses you out; ignore it!

Let’s also use our knowledge of substring (note the differences between substr and substring) to get the actual amino acid sequence:

orfSequence <- substring( sampleAminoAcids
, first = startPositions
, last  = startPositions + lengths -1
)

What does orfSequence hold now? Why do we need the “-1” in the last argument?

So now we know how to write code that will match all of the possible ORFs in a sequence of amino acids! Now let’s think about how we want to collect our ORF information in a user friendly way.

## Creating a data.frame

We’ve used data.frame’s before by loading tabular data in from files, but you can also create them de novo from with in an R session. The syntax is:

data.frame( variable1 = c(1,2,3,4)
, variable2 = c("a","b","c","d")
)
##   variable1 variable2
## 1         1         a
## 2         2         b
## 3         3         c
## 4         4         d

See what happened there? Each argument will become a column heading in your data.frame, and its value will become the data in that column. You’ll want to make sure your value vectors are all the same lengths or you’ll get an error.

So how can we store our ORF information in a convenient tabular format? With something like this:

orfTable <- data.frame( startPosition = startPositions
, length        = lengths
, aminoacids    = orfSequence
)

Take a look at the contents of orfTable. It’s time to take all of the parts we’ve just looked at and assemble them back into a whole.

Once you’ve created a data.frame in your code you can easily write it out to a file with the write.table function.

Return to the template for your findOrf function above and fill in the function body. The thing we promised that this function would do that we haven’t handled yet is filtering out ORFs that are below the size given by the cutoff argument. Use your knowledge of selectors ([]) to filter the final version version of your orfTable based on the values in the length column.

# Step 3: Annotate a complete chromosome

## Overview

Now we’re ready to think about the structure of our function which will annotate all of the possible ORFs in the sequence for a chromosome. Here’s the template we’ll use for our big annotation function:

# Finds all open reading frames in the chromosome sequence in a FASTA file
#
# fastaFile   A string containing the name of a chromosome file in FASTA format
#
# value       A data.frame.  Coordinates are relative to DNA sequence in input.
#
annotateChromosome <- function(fastaFile) {

}

We’ll also write a few helper functions to keep annotateChromosome from becoming too bloated.

Here are the templates:

# Parses a sequence file in FASTA format
#
# fastaFile   A string giving the name of a fasta file to parse
#
# value       A string containing the parsed sequence
#

}

# Annotates the ORFs found in a given reading frame of a dnaString
#
# dnaStrand   A string containing DNA sequence
# offset      The frame offset (0, 1, or 2)
#
# value       A data.frame with a length and startPosition column.
#             Coordinates are relative to dnaStrand.
#
annotateFrame <- function(dnaStrand, offset) {

}

# Calculates the reverse complement of a dnaStrand
#
# dnaStrand   A string containing the forward DNA sequence
#
# value       A string containing the reverse complement to dnaStrand
#
reverseComplement <- function(dnaStrand) {

}

# Reverses the characters in a string
reverseString <- function(a) {
paste(rev(substring(a,1:nchar(a),1:nchar(a))),collapse="")
}

I snuck in a function implementation in the last section; this is a common idiom used in R to reverse the sequence of a string. Hopefully you can prove to yourself that you understand why it works!

I’m not going to walk you through exactly how to write the body of each of these functions. Instead, I’ll give some pointers for each in the sections that follow.

## reverseComplement

To implement this function you’ll probably want to use the reverseString utility function. You’ll hopefully also be thinking that you’ll want to use gsub to switch A -> T, G -> C and visa versa.

However, consider the following:

dna <- "ATGCATCG"
dna <- gsub("A","T", dna)
dna
## [1] "TTGCTTCG"
dna <- gsub("G","C", dna)
dna
## [1] "TTCCTTCC"
dna <- gsub("T","A", dna)
dna
## [1] "AACCAACC"
dna <- gsub("C","G", dna)
dna
## [1] "AAGGAAGG"

What happened here? To get around this problem you can use a little trick: switch the case of the substitution letter then use the toupper function to convert lower case letters back to upper case. For example:

dna <- "ATGCATCG"
dna <- gsub("A","t", dna)
dna
## [1] "tTGCtTCG"
dna <- toupper(dna)
dna
## [1] "TTGCTTCG"

## annotateFrame

This function is setup to accept a frame offset argument. You’ll call annotateFrame from annotate chromosome six times, once for each strand, and three times with offset = 0, offset = 1, and offset = 2.

If you call findORFs from annotateFrame remember that you’ll want to convert lengths and startPositions from the amino acid coordinates to the correct DNA coordinates.

Take a look at the format of FASTA files. You can find a complete yeast chromosome 1 FASTA file here: data/chr01.fsa">data/chr01.fsa.

Looking at the contents of this file, you’ll probably want to omit the first line of this file because it doesn’t contain sequence. We can do that with indexing:

dnaLines <- readLines("data/chr01.fsa")
length(dnaLines)
## [1] 499
dnaLines <- dnaLines[2:length(dnaLines)]

Also we’ll want to collapse all of the strings in this vector into a single string. We can use paste to do that:

dna <- paste(dnaLines, collapse = "", sep = "")

## Combining data.frame objects

In several places you’ll might want to be able to combine data.frame tables, adding the rows of one table to another. See the merge function for a convenient way to do this!

# Step 4: Annotate the whole genome!

Once you have your annotateChromosome function working with a sample chromosome FASTA file (takes the name of a chromosome FASTA file and produces a data.frame with ORF information), you’re ready to write a function that annotates the entire yeast genome.

You can download the latest yeast genome sequence here. Each of these files contains the DNA sequence for one of the yeast chromosomes.

This is probably a good time to introduce a Linux command line tool that is really useful when you need to pull files off the internet as part of a data analysis project. In R studio, you can open a Linux command line shell, by opening Tools -> Shell. The wget program takes will download a file from a URL to the local directory.

For example:

wget http://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chr02.fsa

# Challenge #1: Transcription factor binding site

The TATA-Binding Protein (TBP) is an important eukaryotic transcription factor. It promotes transcription at many, but not all, eukaryotic genes. It binds to the “TATA-box” sequence upstream of genes:

TATAAA

Modify your annotation script so that it produces a table listing TBP binding sites in the yeast genome. Include a column that lists the likely ORF that is the target of TBP transcription target. Remember that you’ll have to keep track of both DNA strands.

What percentage of yeast genes are likely transcribed by TBP bound promoters?

# Challenge #2: Handle introns

Yeast is a eukaryote, which means the protein coding sequences (ORFs) can be interrupted by introns. Introns are defined by 3 sequences: start sequence, end sequence and a special sequence in the middle of the intron called a branch point sequence. Unlike the TATA-Box however, each of these sequences comes in several different flavors. Specifically, the variants are:

Start sites:

GUAUGU
GUAAGU
GUAUGC
GUAUGA
GUACGU
GUCAGU
GUUAAG
GUAGUA
GCAUGU
GUUCGU
GUGAGU
GCAAGU

Branch points:

GACUAAC
UACUAAC
AACUAAC
AAUUAAC
CACUAAC
UGCUAAC
UAUUAAC
AGUUAAC
CGUUAAC
UGUUAAC
CAUUAAC

End sites:

CAG
AAG
UAG

We can use regular expressions to find patterns that have variable characters at different positions. Consider the following example:

grepl("bio285", "Hello bio285!")
## [1] TRUE
grepl("bio285", "Hello bio385!")
## [1] FALSE
#Match course number 285 or 385
grepl("bio[23]85", "Hello bio285!")
## [1] TRUE
grepl("bio[23]85", "Hello bio285!")
## [1] TRUE

See what happened there? To match all splice end sites we could use the regular expression:

"[CAU]AG"

Re-factor your annotateChromosome function so that it first finds and then removes (hint: gsub) all of the likely introns from the chromosomal sequence before attempting to annotate the ORFs. You might

How did your ORF list change when you accounted for introns?