Step 2: Writing an ORF finder

Function template

With our transcribe and translate functions done, we now have the tools in place that we’ll need to write a function that finds ORFs. This function will take a string containing a sequence of amino acids and find all of the possible ORFs in that string. The logic of this function will be to identify stretches of amino acids that start with a methionine (“M”) and end with one of our stop codons (“X”). This function should also accept a ‘cutoff’ argument which will be used to filter ORFs that are too short to be believable.

Let’s start off by adding a template for the findORF function to our genomeAnnotation.r script.

# Returns a data.frame describing the ORFs found in the given aminoacid sequence
#
# aminoacids  A character string containing aminoacid sequence
# cutoff      An integer giving the minimum aminoacid length of valid ORFs
#
# value       A data.frame, coordinates refer to amino acid position
#
findORF <- function(aminoacids, cutoff = 100) {

}

So what’s the first thing this function is going to need to do? We want it to find stretches of amino acids that look like they correspond to real open reading frames. What’s the best way to look for small patterns in larger strings? The answer is regular expressions.

Regular expressions

Regular expressions are a powerful tool available in many programming languages which are designed to find patterns in text. Regular expressions are a mini-programming language all of their own which allows us to describe many different types of complex patterns.

In this case the pattern we need to find is:

M...X

Where “M” is a methionine, “X” is one of our stop codons, and “…” represents a variable number of amino acids between M and X. Furthermore, we want to find ALL of the non-overlapping instances of the “M…X” pattern in our input amino acid sequence.

You can read about the full power of the regular expression syntax in the R help, but for today I’ll guide you through how to write them.

Our regular expression pattern is going to be the string:

"M(.*?)X"

Here’s what’s going on in the pattern:

  • The “M” at the beginning says ‘look for patterns that start with M’
  • The “X” at the end says ‘…and end with X’
  • The “(…)” adds a subpattern that we’ll search for between the M and X
  • “.” says ‘find any character that isn’t a space’
  • The “*" says ‘find between 0 and an infinite number of those non-space characters’
  • The “?” says ‘but make the stuff inside of the “(…)” as small as possible, while still conforming to my other rules’.

If we didn’t add the “?” this pattern would match starting at the first M in our sequence – then gobble up all other M’s and X’s – until it hit the last X in the sequence. Obviously, not what we want!

All of the functions listed on the help page for ?sub can take regular expressions as their search patterns. Today we’ll use gregexpr (global regular expression) function.

When trying to learn how a new function works, it’s often useful to make up some example data that will make it easy to discover what a function is doing. Let’s make a sample “amino acid” sequence that will be easy to keep track of.

sampleAminoAcids <- "aaaaMbbbbXccccMddddXeeee"

See what I did there? If everything is working correctly we should find “MbbbbX” and “MddddX” and nothing else.

Let’s give gregexpr a whirl with our sample data and our regular expression:

gregexpr( pattern = "M(.*?)X", text = sampleAminoAcids )
## [[1]]
## [1]  5 15
## attr(,"match.length")
## [1] 6 6
## attr(,"useBytes")
## [1] TRUE

There’s a lot going with the output from this function – we’re starting to get a peak at some advanced R features. But notice that line that has the vector 5, 15? That suspiciously similar to the positions of our “M”’s in the sample string. See the line that has the vector 6, 6? Rather interesting, since both of our sample ORF sequences should be 6 amino acids long! As a novice user, you should be thinking to yourself “I think this is going to work for us”!

The structure being returned by gregexpr is called a list. It’s kind of like a basket that can hold other smaller types of data, like vectors.

Try the following:

matches         <- gregexpr(pattern = "M(.*?)X", text = sampleAminoAcids)
startPositions  <- as.vector(matches[[1]])
lengths         <- attr(matches[[1]], "match.length")

startPositions
## [1]  5 15
lengths
## [1] 6 6

Let’s also use our knowledge of substring (note the differences between substr and substring) to get the actual amino acid sequence:

orfSequence <- substring( sampleAminoAcids
                        , first = startPositions
                        , last  = startPositions + lengths - 1
                        )

What does orfSequence hold now? Why do we need the “-1” in the last argument?

So now we know how to write code that will match all of the possible ORFs in a sequence of amino acids! Now let’s think about how we want to collect our ORF information in a user friendly way.

Creating a data.frame

We’ve used data.frame’s before by loading tabular data in from files, but you can also create them de novo from with in an R session. The syntax is:

data.frame( variable1 = c(1,2,3,4)
          , variable2 = c("a","b","c","d")
          )
##   variable1 variable2
## 1         1         a
## 2         2         b
## 3         3         c
## 4         4         d

See what happened there? Each argument will become a column heading in your data.frame, and its value will become the data in that column. You’ll want to make sure your value vectors are all the same lengths or you’ll get an error.

So how can we store our ORF information in a convenient tabular format? With something like this:

orfTable <- data.frame( startPosition = startPositions
                      , length        = lengths
                      , aminoacids    = orfSequence
                      )

Take a look at the contents of orfTable. It’s time to take all of the parts we’ve just looked at and assemble them back into a whole.

Once you’ve created a data.frame in your code you can easily write it out to a file with the write.table function.

Write your findORF function

Return to the template for your findOrf function above and fill in the function body. You’ll need to use all of the parts we’ve just covered to implement this function.

Note that one thing we promised that this function would do that we haven’t discussed yet is filtering out ORFs that are below the size given by the cutoff argument. You will probably need to use your knowledge of indexing ([]) to filter the final version version of your orfTable based on the values in the length column.

# Enter your code here!

Let’s test our implementation to make sure it works on contrived inputs:

# This should fail
findORF("aaaaMbbbbXccccMddddXeeee")
## [1] aa.start  aa.length aa.seq   
## <0 rows> (or 0-length row.names)
# This should work
findORF("aaaaMbbbbXccccMddddXeeee", cutoff = 5)
##   aa.start aa.length aa.seq
## 1        5         6 MbbbbX
## 2       15         6 MddddX
# So should this
many <- paste( rep("aaaaMbbbbXccccMddddXeeee", times = 100), collapse = "ffff" )
findORF(many, cutoff = 5)
##     aa.start aa.length aa.seq
## 1          5         6 MbbbbX
## 2         15         6 MddddX
## 3         33         6 MbbbbX
## 4         43         6 MddddX
## 5         61         6 MbbbbX
## 6         71         6 MddddX
## 7         89         6 MbbbbX
## 8         99         6 MddddX
## 9        117         6 MbbbbX
## 10       127         6 MddddX
## 11       145         6 MbbbbX
## 12       155         6 MddddX
## 13       173         6 MbbbbX
## 14       183         6 MddddX
## 15       201         6 MbbbbX
## 16       211         6 MddddX
## 17       229         6 MbbbbX
## 18       239         6 MddddX
## 19       257         6 MbbbbX
## 20       267         6 MddddX
## 21       285         6 MbbbbX
## 22       295         6 MddddX
## 23       313         6 MbbbbX
## 24       323         6 MddddX
## 25       341         6 MbbbbX
## 26       351         6 MddddX
## 27       369         6 MbbbbX
## 28       379         6 MddddX
## 29       397         6 MbbbbX
## 30       407         6 MddddX
## 31       425         6 MbbbbX
## 32       435         6 MddddX
## 33       453         6 MbbbbX
## 34       463         6 MddddX
## 35       481         6 MbbbbX
## 36       491         6 MddddX
## 37       509         6 MbbbbX
## 38       519         6 MddddX
## 39       537         6 MbbbbX
## 40       547         6 MddddX
## 41       565         6 MbbbbX
## 42       575         6 MddddX
## 43       593         6 MbbbbX
## 44       603         6 MddddX
## 45       621         6 MbbbbX
## 46       631         6 MddddX
## 47       649         6 MbbbbX
## 48       659         6 MddddX
## 49       677         6 MbbbbX
## 50       687         6 MddddX
## 51       705         6 MbbbbX
## 52       715         6 MddddX
## 53       733         6 MbbbbX
## 54       743         6 MddddX
## 55       761         6 MbbbbX
## 56       771         6 MddddX
## 57       789         6 MbbbbX
## 58       799         6 MddddX
## 59       817         6 MbbbbX
## 60       827         6 MddddX
## 61       845         6 MbbbbX
## 62       855         6 MddddX
## 63       873         6 MbbbbX
## 64       883         6 MddddX
## 65       901         6 MbbbbX
## 66       911         6 MddddX
## 67       929         6 MbbbbX
## 68       939         6 MddddX
## 69       957         6 MbbbbX
## 70       967         6 MddddX
## 71       985         6 MbbbbX
## 72       995         6 MddddX
## 73      1013         6 MbbbbX
## 74      1023         6 MddddX
## 75      1041         6 MbbbbX
## 76      1051         6 MddddX
## 77      1069         6 MbbbbX
## 78      1079         6 MddddX
## 79      1097         6 MbbbbX
## 80      1107         6 MddddX
## 81      1125         6 MbbbbX
## 82      1135         6 MddddX
## 83      1153         6 MbbbbX
## 84      1163         6 MddddX
## 85      1181         6 MbbbbX
## 86      1191         6 MddddX
## 87      1209         6 MbbbbX
## 88      1219         6 MddddX
## 89      1237         6 MbbbbX
## 90      1247         6 MddddX
## 91      1265         6 MbbbbX
## 92      1275         6 MddddX
## 93      1293         6 MbbbbX
## 94      1303         6 MddddX
## 95      1321         6 MbbbbX
## 96      1331         6 MddddX
## 97      1349         6 MbbbbX
## 98      1359         6 MddddX
## 99      1377         6 MbbbbX
## 100     1387         6 MddddX
## 101     1405         6 MbbbbX
## 102     1415         6 MddddX
## 103     1433         6 MbbbbX
## 104     1443         6 MddddX
## 105     1461         6 MbbbbX
## 106     1471         6 MddddX
## 107     1489         6 MbbbbX
## 108     1499         6 MddddX
## 109     1517         6 MbbbbX
## 110     1527         6 MddddX
## 111     1545         6 MbbbbX
## 112     1555         6 MddddX
## 113     1573         6 MbbbbX
## 114     1583         6 MddddX
## 115     1601         6 MbbbbX
## 116     1611         6 MddddX
## 117     1629         6 MbbbbX
## 118     1639         6 MddddX
## 119     1657         6 MbbbbX
## 120     1667         6 MddddX
## 121     1685         6 MbbbbX
## 122     1695         6 MddddX
## 123     1713         6 MbbbbX
## 124     1723         6 MddddX
## 125     1741         6 MbbbbX
## 126     1751         6 MddddX
## 127     1769         6 MbbbbX
## 128     1779         6 MddddX
## 129     1797         6 MbbbbX
## 130     1807         6 MddddX
## 131     1825         6 MbbbbX
## 132     1835         6 MddddX
## 133     1853         6 MbbbbX
## 134     1863         6 MddddX
## 135     1881         6 MbbbbX
## 136     1891         6 MddddX
## 137     1909         6 MbbbbX
## 138     1919         6 MddddX
## 139     1937         6 MbbbbX
## 140     1947         6 MddddX
## 141     1965         6 MbbbbX
## 142     1975         6 MddddX
## 143     1993         6 MbbbbX
## 144     2003         6 MddddX
## 145     2021         6 MbbbbX
## 146     2031         6 MddddX
## 147     2049         6 MbbbbX
## 148     2059         6 MddddX
## 149     2077         6 MbbbbX
## 150     2087         6 MddddX
## 151     2105         6 MbbbbX
## 152     2115         6 MddddX
## 153     2133         6 MbbbbX
## 154     2143         6 MddddX
## 155     2161         6 MbbbbX
## 156     2171         6 MddddX
## 157     2189         6 MbbbbX
## 158     2199         6 MddddX
## 159     2217         6 MbbbbX
## 160     2227         6 MddddX
## 161     2245         6 MbbbbX
## 162     2255         6 MddddX
## 163     2273         6 MbbbbX
## 164     2283         6 MddddX
## 165     2301         6 MbbbbX
## 166     2311         6 MddddX
## 167     2329         6 MbbbbX
## 168     2339         6 MddddX
## 169     2357         6 MbbbbX
## 170     2367         6 MddddX
## 171     2385         6 MbbbbX
## 172     2395         6 MddddX
## 173     2413         6 MbbbbX
## 174     2423         6 MddddX
## 175     2441         6 MbbbbX
## 176     2451         6 MddddX
## 177     2469         6 MbbbbX
## 178     2479         6 MddddX
## 179     2497         6 MbbbbX
## 180     2507         6 MddddX
## 181     2525         6 MbbbbX
## 182     2535         6 MddddX
## 183     2553         6 MbbbbX
## 184     2563         6 MddddX
## 185     2581         6 MbbbbX
## 186     2591         6 MddddX
## 187     2609         6 MbbbbX
## 188     2619         6 MddddX
## 189     2637         6 MbbbbX
## 190     2647         6 MddddX
## 191     2665         6 MbbbbX
## 192     2675         6 MddddX
## 193     2693         6 MbbbbX
## 194     2703         6 MddddX
## 195     2721         6 MbbbbX
## 196     2731         6 MddddX
## 197     2749         6 MbbbbX
## 198     2759         6 MddddX
## 199     2777         6 MbbbbX
## 200     2787         6 MddddX