# Setup

We’re going to put the pieces of the last two exercises together with some new higher level functions to actually start annotating a chromosome. Include your refined implementations for the transcribe, translate and findORF functions here:

# Enter your code here!


# Step 3: Annotate a complete chromosome

## Overview

Now we’re ready to think about the structure of our function which will annotate all of the possible ORFs in the sequence for a chromosome. Here’s the template we’ll use for our big annotation function:

# Finds all open reading frames in the chromosome sequence in a FASTA file
#
# fastaFile   A string containing the name of a chromosome file in FASTA format
#
# value       A data.frame.  Coordinates are relative to DNA sequence in input.
#
annotateChromosome <- function(fastaFile) {

}
# Enter your code here!


We code write all of the code we need to accomplish the goals of annotateChromosome in one ginormous function body. But it’s always a good idea to avoid doing this: as a rule of thumb if a function has grown beyond a dozen or so lines, you might want to break it up into sub-parts to make it easier to read and alter in the future. So we’ll also write a few helper functions to keep annotateChromosome from becoming bloated.

Here are the templates:

# Parses a sequence file in FASTA format
#
# fastaFile   A string giving the name of a fasta file to parse
#
# value       A string containing the parsed sequence
#

}
# Enter your code here!

# Annotates the ORFs found in a given reading frame of a dnaString
#
# dnaStrand   A string containing DNA sequence
# offset      The frame offset (0, 1, or 2)
#
# value       A data.frame with a length and startPosition column.
#             Coordinates are relative to dnaStrand.
#
annotateFrame <- function(dnaStrand, offset) {

}
# Enter your code here!

# Calculates the reverse complement of a dnaStrand
#
# dnaStrand   A string containing the forward DNA sequence
#
# value       A string containing the reverse complement to dnaStrand
#
reverseComplement <- function(dnaStrand) {

}
# Enter your code here!

# Reverses the characters in a string
reverseString <- function(a) {

}
# Enter your code here!


I’m not going to walk you through exactly how to write the body of each of these functions. Instead, I’ll give some pointers for each in the sections that follow.

## reverseString and reverseComplement

To implement this function you’ll probably want to start by filling in the reverseString utility function. For that job you might want to use the strsplit function (using split = "" will split a string into it’s characters). Note that strsplit returns a list with one element for each string in the character vector you give it.

For example:

strsplit(   "abcd"         , split = "")
## [[1]]
## [1] "a" "b" "c" "d"
strsplit( c("abcd", "defg"), split = "")
## [[1]]
## [1] "a" "b" "c" "d"
##
## [[2]]
## [1] "d" "e" "f" "g"

Remember, you can always select the element at a position in a list using double [[]]. For example:

strsplit("abcd", split = "")[[1]]
## [1] "a" "b" "c" "d"

For reverseComplement you’ll hopefully also be thinking that you’ll want to use gsub to switch A -> T, G -> C and visa versa.

However, consider the following:

dna <- "ATGCATCG"
dna <- gsub("A","T", dna)
dna
## [1] "TTGCTTCG"
dna <- gsub("G","C", dna)
dna
## [1] "TTCCTTCC"
dna <- gsub("T","A", dna)
dna
## [1] "AACCAACC"
dna <- gsub("C","G", dna)
dna
## [1] "AAGGAAGG"

What happened here? To get around this problem I like to use a little trick with cases: switch the case of the substitution letter then use the toupper function to convert lower case letters back to upper case. For example:

dna <- "ATGCATCG"
dna <- gsub("A","t", dna)
dna
## [1] "tTGCtTCG"
dna <- toupper(dna)
dna
## [1] "TTGCTTCG"

As long as you’ve sanitized your inputs to this function and can rely on a starting case, it’s a simple way to avoid overlapping substitutions.

## annotateFrame

This function is setup to accept a frame offset argument. You’ll call annotateFrame from annotate chromosome six times, once for each strand, and three times with offset = 0, offset = 1, and offset = 2. If you’re having trouble visualizing reading frames in sequence and traking starting positions I’d recommend drawing out a diagram for yourself.

If you call findORFs from annotateFrame remember that you’ll want to convert lengths and startPositions from the amino acid coordinates to the correct DNA coordinates!

Take a look at the format of FASTA files. You can find a truncated version of yeast chromosome 1 FASTA (DNA sequence) file here: data/chr01-truncated.fsa"">data/chr01-truncated.fsa. Once we know our code is working we can run it on the full sequence, but for now it makes sense to use a smaller test dataset for troubleshooting.

Looking at the contents of this file, you’ll probably want to omit the first line of this file because it doesn’t contain sequence. We can do that with indexing:

dnaLines <- readLines("data/chr01-truncated.fsa")
length(dnaLines)
## [1] 499
dnaLines <- dnaLines[2:length(dnaLines)]

Also we’ll want to collapse all of the strings in this vector into a single string. We can use paste to do that:

dna <- paste(dnaLines, collapse = "", sep = "")

## Combining data.frame objects

In several places you’ll might want to be able to combine data.frame tables, adding the rows of one table to another. See the merge, rbind and cbind functions for a convenient way to do this. Alternatively it can be done with the tidyr package; see the cheatsheet for reference.

# Step 4: Annotate the whole genome!

Once you have your annotateChromosome function working with a sample chromosome FASTA file (takes the name of a chromosome FASTA file and produces a data.frame with ORF information), you’re ready to write a function that annotates the entire yeast genome.

You can download the latest yeast genome sequence here. Each of these files contains the DNA sequence for one of the yeast chromosomes.

This is probably a good time to introduce a Linux command line tool that is really useful when you need to pull files off the internet as part of a data analysis project. In R studio, you can open a Linux command line shell, by opening Tools -> Shell. The wget program takes will download a file from a URL to the local directory.

For example:

wget http://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chr02.fsa

# Challenge #1: Transcription factor binding site

The TATA-Binding Protein (TBP) is an important eukaryotic transcription factor. It promotes transcription at many, but not all, eukaryotic genes. It binds to the “TATA-box” sequence upstream of genes:

TATAAA

Modify your annotation script so that it produces a table listing TBP binding sites in the yeast genome. Include a column that lists the likely ORF that is the target of TBP transcription target. Remember that you’ll have to keep track of both DNA strands.

What percentage of yeast genes are likely transcribed by TBP bound promoters?

# Challenge #2: Handle introns

Yeast is a eukaryote, which means the protein coding sequences (ORFs) can be interrupted by introns. Introns are defined by 3 sequences: start sequence, end sequence and a special sequence in the middle of the intron called a branch point sequence. Unlike the TATA-Box however, each of these sequences comes in several different flavors. Specifically, the variants are:

Start sites:

GUAUGU
GUAAGU
GUAUGC
GUAUGA
GUACGU
GUCAGU
GUUAAG
GUAGUA
GCAUGU
GUUCGU
GUGAGU
GCAAGU

Branch points:

GACUAAC
UACUAAC
AACUAAC
AAUUAAC
CACUAAC
UGCUAAC
UAUUAAC
AGUUAAC
CGUUAAC
UGUUAAC
CAUUAAC

End sites:

CAG
AAG
UAG

We can use regular expressions to find patterns that have variable characters at different positions. Consider the following example:

grepl("bio297", "Hello bio297!")
## [1] TRUE
grepl("bio297", "Hello bio397!")
## [1] FALSE
#Match course number 297 or 397
grepl("bio[23]97", "Hello bio297!")
## [1] TRUE
grepl("bio[23]97", "Hello bio297!")
## [1] TRUE

See what happened there? To match all splice end sites we could use the regular expression:

"[CAU]AG"

Re-factor your annotateChromosome function so that it first finds and then removes (hint: gsub) all of the likely introns from the chromosomal sequence before attempting to annotate the ORFs. You might

How did your ORF list change when you accounted for introns?