Explorying Genomes

Preface on Best Practices

Using functions to organize code

Your code should function as a document meant for other people to read; it is a narrative that tells a story about computation. That machines can parse and execute your code is a happy side effect of a well-designed language.

To create a function:

addTwo <- function(a, b) {
  a + b
}

Arguments to functions can have default values. For example:

addThree <- function(a, b = 1, c = 1) {
  a + b + c
}
addThree(1)
## [1] 3
addThree(1, 2)
## [1] 4
addThree(1, c = 4)
## [1] 6

Matters of Style (and style matters!)

I won't strictly enforce one style, but I do ask that you follow two rules:

  • ALWAYS be consistent.
  • Keep each line of code to 80 characters or less

In R, arguments to functions are separated by commas, so we might create a named vector like:

myVar <- c(a = 10, b = 100, c = 12, d = 14, exz = 142, f = 1293, g = 0, h = NA)

myVar <- c(a    = 10, 
           b    = 100, 
           c    = 12, 
           d    = 14, 
           exz  = 142, 
           f    = 1293, 
           g    = 0, 
           h    = NA)

In the comma-first style, you … put the comma first:

myVar <- c( a    = 10
          , b    = 100
          , c    = 12
          , d    = 14
          , exz  = 142
          , f    = 1293
          , g    = 0
          , h    = NA
          )

myVar <- c(a    = 10, 
           b    = 100, 
           c    = 12, 
           d    = 14 
           exz  = 142, 
           f    = 1293, 
           g    = 0, 
           h    = NA)

myVar <- c( a    = 10
          , b    = 100
          , c    = 12
          , d    = 14
          , exz  = 142
            f    = 1293
          , g    = 0
          , h    = NA
          )

Step 1: transcribe and translate

A transcribe function

Here's a template for our transcribe function:

# Transcibes DNA sequence into RNA equivalent
#
# dna   A character string containing DNA sequence
#
# value A character string containing RNA sequence
#
transcribe <- function(dna) {
  
}
# Enter your code here!

See how we've used the comment lines to clearly describe:

  • What our function does (fist line)
  • What kinds of data we expect for the arguments (dna)
  • What kinds of data the function produces (value)

We'll write descriptions like this for all of the functions we include in our analysis files.

Here is the suggested building block for your transcribe function:

  • gsub
  • toupper or tolower

Now let's test your transcribe function with some sample input:

transcribe("ATGCTTATCTA")
## [1] "AUGCUUAUCUA"
transcribe("atgcttatcta")
## [1] "AUGCUUAUCUA"

If you got different results, go back and try to fix your function.

A translate function

Now lets write a a translate function that follows the following template:

# Translates RNA sequence into amino acid sequence
#
# rna   A character string containing RNA sequence
#
# value A character string containing amino sequence
#
translate <- function(rna) {
  
  # Read the genetic code into a data.frame from "data/codons.txt"
  
  # Split the `rna` string into a vector of codons
  
  # Return the amino acid sequence as a string
  
}
# Enter your code here!

Consider the difference between:

a <- c("ATG","CTT","ATC")
b <- "ATGCTTATC"
a == b
## [1] FALSE FALSE FALSE

Make sure you understand why these two vectors are different before moving on!

Here are the suggested building blocks for your translation function:

  • read.table note the stringsAsFactors and row.names arguments
  • sapply
  • substr
  • seq
  • paste note the collapse argument

Hopefully you're becoming comfortable with vectorized operations in R. For example:

a <- c(1, 2, 3, 4)
a + 1
## [1] 2 3 4 5

The sapply (for simple apply) takes a vector and a function to apply:

sapply( a, function(n) { n + 1 } )
## [1] 2 3 4 5
  • Use seq to generate codon start positions in rna
  • Use codon start (and stop) positions with substr to extract codon sequence
  • Use sapply and a custom function to run substr for each codon

translate("AUG")
## [1] "M"
translate("UUCUAAAUUAACAAAAUC")
## [1] "FSTOPINKI"
translate("UUCUAAAUUAACAAAAU")
## [1] "FSTOPINK"
translate("UUCUAAAUUAACAAAA")
## [1] "FSTOPINK"

(Hint: check out the nchar function along with the modulo operator %%/)

When your function is working on all of the inputs above, you're ready to move on to step 2!