Step 2: Writing an ORF finder

Function template

Let's start off by adding a template for the findORF function to our genomeAnnotation.r script.

# Returns a data.frame describing the ORFs found in the given aminoacid sequence
#
# aminoacids  A character string containing aminoacid sequence
# cutoff      An integer giving the minimum aminoacid length of valid ORFs
#
# value       A data.frame, coordinates refer to amino acid position
#
findORF <- function(aminoacids, cutoff = 100) {

}

Regular expressions

In this case the pattern we need to find is:

M...X

Our regular expression pattern is going to be the string:

"M(.*?)X"

Here's what's going on in the pattern:

  • The "M" at the beginning says 'look for patterns that start with M'
  • The "X" at the end says '…and end with X'
  • The "(…)" adds a subpattern that we'll search for between the M and X
  • "." says 'find any character that isn't a space'
  • The "*" says 'find between 0 and an infinite number of those non-space characters'
  • The "?" says 'but make the stuff inside of the "(…)" as small as possible, while still conforming to my other rules'.

sampleAminoAcids <- "aaaaMbbbbXccccMddddXeeee"

Let's give gregexpr a whirl with our sample data and our regular expression:

gregexpr( pattern = "M(.*?)X", text = sampleAminoAcids )
## [[1]]
## [1]  5 15
## attr(,"match.length")
## [1] 6 6
## attr(,"useBytes")
## [1] TRUE

Try the following:

matches         <- gregexpr(pattern = "M(.*?)X", text = sampleAminoAcids)
startPositions  <- as.vector(matches[[1]])
lengths         <- attr(matches[[1]], "match.length")

startPositions
## [1]  5 15
lengths
## [1] 6 6

orfSequence <- substring( sampleAminoAcids
                        , first = startPositions
                        , last  = startPositions + lengths - 1
                        )

What does orfSequence hold now? Why do we need the "-1" in the last argument?

Creating a data.frame

data.frame( variable1 = c(1,2,3,4)
          , variable2 = c("a","b","c","d")
          )
##   variable1 variable2
## 1         1         a
## 2         2         b
## 3         3         c
## 4         4         d

So how can we store our ORF information in a convenient tabular format? With something like this:

orfTable <- data.frame( startPosition = startPositions
                      , length        = lengths
                      , aminoacids    = orfSequence
                      )

Write your findORF function

# Enter your code here!

Let's test our implementation to make sure it works on contrived inputs:

# This should fail
findORF("aaaaMbbbbXccccMddddXeeee")
## [1] aa.start  aa.length aa.seq   
## <0 rows> (or 0-length row.names)
# This should work
findORF("aaaaMbbbbXccccMddddXeeee", cutoff = 5)
##   aa.start aa.length aa.seq
## 1        5         6 MbbbbX
## 2       15         6 MddddX
# So should this
many <- paste( rep("aaaaMbbbbXccccMddddXeeee", times = 100), collapse = "ffff" )
findORF(many, cutoff = 5)
##     aa.start aa.length aa.seq
## 1          5         6 MbbbbX
## 2         15         6 MddddX
## 3         33         6 MbbbbX
## 4         43         6 MddddX
## 5         61         6 MbbbbX
## 6         71         6 MddddX
## 7         89         6 MbbbbX
## 8         99         6 MddddX
## 9        117         6 MbbbbX
## 10       127         6 MddddX
## 11       145         6 MbbbbX
## 12       155         6 MddddX
## 13       173         6 MbbbbX
## 14       183         6 MddddX
## 15       201         6 MbbbbX
## 16       211         6 MddddX
## 17       229         6 MbbbbX
## 18       239         6 MddddX
## 19       257         6 MbbbbX
## 20       267         6 MddddX
## 21       285         6 MbbbbX
## 22       295         6 MddddX
## 23       313         6 MbbbbX
## 24       323         6 MddddX
## 25       341         6 MbbbbX
## 26       351         6 MddddX
## 27       369         6 MbbbbX
## 28       379         6 MddddX
## 29       397         6 MbbbbX
## 30       407         6 MddddX
## 31       425         6 MbbbbX
## 32       435         6 MddddX
## 33       453         6 MbbbbX
## 34       463         6 MddddX
## 35       481         6 MbbbbX
## 36       491         6 MddddX
## 37       509         6 MbbbbX
## 38       519         6 MddddX
## 39       537         6 MbbbbX
## 40       547         6 MddddX
## 41       565         6 MbbbbX
## 42       575         6 MddddX
## 43       593         6 MbbbbX
## 44       603         6 MddddX
## 45       621         6 MbbbbX
## 46       631         6 MddddX
## 47       649         6 MbbbbX
## 48       659         6 MddddX
## 49       677         6 MbbbbX
## 50       687         6 MddddX
## 51       705         6 MbbbbX
## 52       715         6 MddddX
## 53       733         6 MbbbbX
## 54       743         6 MddddX
## 55       761         6 MbbbbX
## 56       771         6 MddddX
## 57       789         6 MbbbbX
## 58       799         6 MddddX
## 59       817         6 MbbbbX
## 60       827         6 MddddX
## 61       845         6 MbbbbX
## 62       855         6 MddddX
## 63       873         6 MbbbbX
## 64       883         6 MddddX
## 65       901         6 MbbbbX
## 66       911         6 MddddX
## 67       929         6 MbbbbX
## 68       939         6 MddddX
## 69       957         6 MbbbbX
## 70       967         6 MddddX
## 71       985         6 MbbbbX
## 72       995         6 MddddX
## 73      1013         6 MbbbbX
## 74      1023         6 MddddX
## 75      1041         6 MbbbbX
## 76      1051         6 MddddX
## 77      1069         6 MbbbbX
## 78      1079         6 MddddX
## 79      1097         6 MbbbbX
## 80      1107         6 MddddX
## 81      1125         6 MbbbbX
## 82      1135         6 MddddX
## 83      1153         6 MbbbbX
## 84      1163         6 MddddX
## 85      1181         6 MbbbbX
## 86      1191         6 MddddX
## 87      1209         6 MbbbbX
## 88      1219         6 MddddX
## 89      1237         6 MbbbbX
## 90      1247         6 MddddX
## 91      1265         6 MbbbbX
## 92      1275         6 MddddX
## 93      1293         6 MbbbbX
## 94      1303         6 MddddX
## 95      1321         6 MbbbbX
## 96      1331         6 MddddX
## 97      1349         6 MbbbbX
## 98      1359         6 MddddX
## 99      1377         6 MbbbbX
## 100     1387         6 MddddX
## 101     1405         6 MbbbbX
## 102     1415         6 MddddX
## 103     1433         6 MbbbbX
## 104     1443         6 MddddX
## 105     1461         6 MbbbbX
## 106     1471         6 MddddX
## 107     1489         6 MbbbbX
## 108     1499         6 MddddX
## 109     1517         6 MbbbbX
## 110     1527         6 MddddX
## 111     1545         6 MbbbbX
## 112     1555         6 MddddX
## 113     1573         6 MbbbbX
## 114     1583         6 MddddX
## 115     1601         6 MbbbbX
## 116     1611         6 MddddX
## 117     1629         6 MbbbbX
## 118     1639         6 MddddX
## 119     1657         6 MbbbbX
## 120     1667         6 MddddX
## 121     1685         6 MbbbbX
## 122     1695         6 MddddX
## 123     1713         6 MbbbbX
## 124     1723         6 MddddX
## 125     1741         6 MbbbbX
## 126     1751         6 MddddX
## 127     1769         6 MbbbbX
## 128     1779         6 MddddX
## 129     1797         6 MbbbbX
## 130     1807         6 MddddX
## 131     1825         6 MbbbbX
## 132     1835         6 MddddX
## 133     1853         6 MbbbbX
## 134     1863         6 MddddX
## 135     1881         6 MbbbbX
## 136     1891         6 MddddX
## 137     1909         6 MbbbbX
## 138     1919         6 MddddX
## 139     1937         6 MbbbbX
## 140     1947         6 MddddX
## 141     1965         6 MbbbbX
## 142     1975         6 MddddX
## 143     1993         6 MbbbbX
## 144     2003         6 MddddX
## 145     2021         6 MbbbbX
## 146     2031         6 MddddX
## 147     2049         6 MbbbbX
## 148     2059         6 MddddX
## 149     2077         6 MbbbbX
## 150     2087         6 MddddX
## 151     2105         6 MbbbbX
## 152     2115         6 MddddX
## 153     2133         6 MbbbbX
## 154     2143         6 MddddX
## 155     2161         6 MbbbbX
## 156     2171         6 MddddX
## 157     2189         6 MbbbbX
## 158     2199         6 MddddX
## 159     2217         6 MbbbbX
## 160     2227         6 MddddX
## 161     2245         6 MbbbbX
## 162     2255         6 MddddX
## 163     2273         6 MbbbbX
## 164     2283         6 MddddX
## 165     2301         6 MbbbbX
## 166     2311         6 MddddX
## 167     2329         6 MbbbbX
## 168     2339         6 MddddX
## 169     2357         6 MbbbbX
## 170     2367         6 MddddX
## 171     2385         6 MbbbbX
## 172     2395         6 MddddX
## 173     2413         6 MbbbbX
## 174     2423         6 MddddX
## 175     2441         6 MbbbbX
## 176     2451         6 MddddX
## 177     2469         6 MbbbbX
## 178     2479         6 MddddX
## 179     2497         6 MbbbbX
## 180     2507         6 MddddX
## 181     2525         6 MbbbbX
## 182     2535         6 MddddX
## 183     2553         6 MbbbbX
## 184     2563         6 MddddX
## 185     2581         6 MbbbbX
## 186     2591         6 MddddX
## 187     2609         6 MbbbbX
## 188     2619         6 MddddX
## 189     2637         6 MbbbbX
## 190     2647         6 MddddX
## 191     2665         6 MbbbbX
## 192     2675         6 MddddX
## 193     2693         6 MbbbbX
## 194     2703         6 MddddX
## 195     2721         6 MbbbbX
## 196     2731         6 MddddX
## 197     2749         6 MbbbbX
## 198     2759         6 MddddX
## 199     2777         6 MbbbbX
## 200     2787         6 MddddX