Setup

# Enter your code here!

Step 3: Annotate a complete chromosome

Overview

# Finds all open reading frames in the chromosome sequence in a FASTA file
#
# fastaFile   A string containing the name of a chromosome file in FASTA format
#
# value       A data.frame.  Coordinates are relative to DNA sequence in input.
#
annotateChromosome <- function(fastaFile) {


}
# Enter your code here!

Here are the templates:

# Parses a sequence file in FASTA format
#
# fastaFile   A string giving the name of a fasta file to parse
#
# value       A string containing the parsed sequence
#
loadFASTA <- function(fastaFile) {


}
# Enter your code here!
# Annotates the ORFs found in a given reading frame of a dnaString
#
# dnaStrand   A string containing DNA sequence
# offset      The frame offset (0, 1, or 2)
#
# value       A data.frame with a `length` and `startPosition` column. 
#             Coordinates are relative to dnaStrand.
#
annotateFrame <- function(dnaStrand, offset) {


}
# Enter your code here!
# Calculates the reverse complement of a dnaStrand
#
# dnaStrand   A string containing the forward DNA sequence
#
# value       A string containing the reverse complement to dnaStrand
#
reverseComplement <- function(dnaStrand) {


}
# Enter your code here!
# Reverses the characters in a string
reverseString <- function(a) {
  
}
# Enter your code here!

reverseString and reverseComplement

For example:

strsplit(   "abcd"         , split = "")
## [[1]]
## [1] "a" "b" "c" "d"
strsplit( c("abcd", "defg"), split = "")
## [[1]]
## [1] "a" "b" "c" "d"
## 
## [[2]]
## [1] "d" "e" "f" "g"

Remember, you can always select the element at a position in a list using double [[]]. For example:

strsplit("abcd", split = "")[[1]]
## [1] "a" "b" "c" "d"

However, consider the following:

dna <- "ATGCATCG"
dna <- gsub("A","T", dna)
dna
## [1] "TTGCTTCG"
dna <- gsub("G","C", dna)
dna
## [1] "TTCCTTCC"
dna <- gsub("T","A", dna)
dna
## [1] "AACCAACC"
dna <- gsub("C","G", dna)
dna
## [1] "AAGGAAGG"
dna <- "ATGCATCG"
dna <- gsub("A","t", dna)
dna
## [1] "tTGCtTCG"
dna <- toupper(dna)
dna
## [1] "TTGCTTCG"

annotateFrame

loadFASTA

dnaLines <- readLines("data/chr01-truncated.fsa")
length(dnaLines)
## [1] 499
dnaLines <- dnaLines[2:length(dnaLines)]
dna <- paste(dnaLines, collapse = "", sep = "")

Combining data.frame objects

Step 4: Annotate the whole genome!

For example:

wget http://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chr02.fsa

Challenge #1: Transcription factor binding site

TATAAA

What percentage of yeast genes are likely transcribed by TBP bound promoters?

Challenge #2: Handle introns

Start sites:

GUAUGU 
GUAAGU 
GUAUGC 
GUAUGA 
GUACGU 
GUCAGU 
GUUAAG 
GUAGUA 
GCAUGU 
GUUCGU 
GUGAGU 
GCAAGU

Branch points:

GACUAAC 
UACUAAC 
AACUAAC 
AAUUAAC 
CACUAAC 
UGCUAAC 
UAUUAAC 
AGUUAAC 
CGUUAAC 
UGUUAAC 
CAUUAAC

End sites:

CAG
AAG
UAG
grepl("bio297", "Hello bio297!")
## [1] TRUE
grepl("bio297", "Hello bio397!")
## [1] FALSE
#Match course number 297 or 397
grepl("bio[23]97", "Hello bio297!")
## [1] TRUE
grepl("bio[23]97", "Hello bio297!")
## [1] TRUE

See what happened there? To match all splice end sites we could use the regular expression:

"[CAU]AG"

How did your ORF list change when you accounted for introns?