Working with simple data

Interacting with R

What do you think will happen when you enter these commands? Try it out.

1 + 2
2 * 3
4 ^ 5
6.7 / 8.9

Getting help

To get the help page for the c():

## ?c
## help(c)

Saving data in variables

a <- 10
a
## [1] 10
b <- a + 11
b
## [1] 21
c <- a / b
c
## [1] 0.4762

If you're working with very large numbers you can use for scientific notation:

2e10
## [1] 2e+10
2 * 10^10
## [1] 2e+10
2e10 == 2 * 10^10
## [1] TRUE

Everything is a vector

You can see how many elements a vector holds using the length function:

length(10)
## [1] 1
length(c)
## [1] 1
length(1:10)
## [1] 10

Composing vectors

c(1,2,3,4)
## [1] 1 2 3 4
d <- c(5,6,7,8)
d + 10
## [1] 15 16 17 18
d + d
## [1] 10 12 14 16

Strings

To create strings, surround your text with either double " ... " or single ' ... ' quotes:

"a"
## [1] "a"
"a" == 'a'
## [1] TRUE
c( "a", "b", "c", "d" )
## [1] "a" "b" "c" "d"

Escape characters

s <- "My data are \"awesome\"!"
cat(s)
## My data are "awesome"!

Two other special string characters are tab \t and newline \n:

s <- "a\tb\tc"
cat(s)
## a    b   c
s <- "a\nb\nc"
cat(s)
## a
## b
## c

Boolean values

To create boolean values use TRUE or FALSE:

TRUE
## [1] TRUE
FALSE
## [1] FALSE
TRUE == FALSE
## [1] FALSE

Missing Data

c(1, 2, NA, 4)
## [1]  1  2 NA  4
c( "a", NA, "c", NA )
## [1] "a" NA  "c" NA
c(TRUE, FALSE, NA, FALSE)
## [1]  TRUE FALSE    NA FALSE
is.na( c(1, NA) )
## [1] FALSE  TRUE

A note about NULL

NULL is used to signify unassigned variables:

NULL
is.null(NULL)
## [1] TRUE

Indexing syntax in R

Extracting values

Let's say we have a vector of numbers:

myNumbers <- c( 10, 20, 30, 40, 50 )

We can extract elements from 1D vectors using the index syntax [] and integers:

myNumbers
## [1] 10 20 30 40 50
myNumbers[1]
## [1] 10
myNumbers[3]
## [1] 30

We can use integer vectors with more than one element inside of our index [...]'s::

myNumbers[ c(1, 3) ]
## [1] 10 30

You can use the : operator to easily create a sequence of numbers:

2:4
## [1] 2 3 4
myNumbers[2:4]
## [1] 20 30 40

myNumbers
## [1] 10 20 30 40 50
myNumbers[ c(FALSE, TRUE , TRUE , TRUE , TRUE ) ]
## [1] 20 30 40 50
myNumbers[ c(TRUE , FALSE, FALSE, FALSE, FALSE) ]
## [1] 10

Logical operators always return a logical vector:

myNumbers > 25
## [1] FALSE FALSE  TRUE  TRUE  TRUE
myNumbers < 25
## [1]  TRUE  TRUE FALSE FALSE FALSE
myNumbers == 30
## [1] FALSE FALSE  TRUE FALSE FALSE
myNumbers != 30
## [1]  TRUE  TRUE FALSE  TRUE  TRUE

The %in% asks if the first set of numbers can be found in the second:

30 %in% myNumbers
## [1] TRUE
c(10, 100) %in% myNumbers
## [1]  TRUE FALSE

The ! operator negates (flips) each value of a logical vector:

!TRUE
## [1] FALSE
!(myNumbers > 25)
## [1]  TRUE  TRUE FALSE FALSE FALSE

So how can we combine logical comparisons with indexing?

myNumbers[myNumbers > 25]
## [1] 30 40 50
myNumbers[myNumbers < 25]
## [1] 10 20

You can get fancy…

myNumbers[ (myNumbers %% 2) == 0 ]
## [1] 10 20 30 40 50

Assigning values

myNumbers
## [1] 10 20 30 40 50
myNumbers[3]    <- 100
myNumbers
## [1]  10  20 100  40  50
myNumbers[2:3]  <- c(1,2)
myNumbers
## [1] 10  1  2 40 50

Bigger data structures

Matrix

A matrix is a vector of vectors, each the same length and with the same type of data:

m <- matrix(1:8, nrow = 2, ncol = 4)
m
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8

You access values on a matrix by using a one element index, refering to a n'th position:

m[2]
## [1] 2

Alternatively you can specify a [row, col]:

m[1,2]
## [1] 3

Or just a row:

m[1,]
## [1] 1 3 5 7

Or just a column:

m[,2]
## [1] 3 4

If you forget this syntax, just pay attention to how R prints out matrixes!

Array

An array is a matrix of more than two dimensions.

array(1:8, dim=c(2,2,2))
## , , 1
## 
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8

Lists

l <- list( a = c(1, 2, 3, 4)
         , b = c("a", "b", "c")
         )
l
## $a
## [1] 1 2 3 4
## 
## $b
## [1] "a" "b" "c"

You can access individual vectors on lists using indexing with numbers or names:

l[1]
## $a
## [1] 1 2 3 4
l["a"]
## $a
## [1] 1 2 3 4

Did you notice what type of thing was returned there?

To simplify the result of indexing down to a vector (rather than a one element list):

l[[1]]
## [1] 1 2 3 4

The $ is short hand for referencing a named element on a list:

l$a
## [1] 1 2 3 4

Working with Tables

A note about table structure

Excel has probably trained you to format data something like this:

Day Group A Group B
1 5 5
2 6 7
3 7 9
4 8 11

The correct design would be a three column table:

Day Group Response
1 A 5
1 B 5
2 A 6
2 B 7
3 A 7
3 B 9
4 A 8
4 B 11

Loading tabular data

codons <- read.table( "data/codons.txt"
                    , header = TRUE
                    , stringsAsFactors = FALSE
                    )
head(codons)
##   codon aminoAcid
## 1   GCU         A
## 2   GCC         A
## 3   GCA         A
## 4   GCG         A
## 5   CGU         R
## 6   CGC         R

Accessing data in a data.frame

codons$codon
##  [1] "GCU" "GCC" "GCA" "GCG" "CGU" "CGC" "CGA" "CGG" "AGA" "AGG" "AAU"
## [12] "AAC" "GAU" "GAC" "UGU" "UGC" "CAA" "CAG" "GAA" "GAG" "GGU" "GGC"
## [23] "GGA" "GGG" "CAU" "CAC" "AUU" "AUC" "AUA" "AUG" "UUA" "UUG" "CUU"
## [34] "CUC" "CUA" "CUG" "AAA" "AAG" "UUU" "UUC" "CCU" "CCC" "CCA" "CCG"
## [45] "UCU" "UCC" "UCA" "UCG" "AGU" "AGC" "ACU" "ACC" "ACA" "ACG" "UGG"
## [56] "UAU" "UAC" "GUU" "GUC" "GUA" "GUG" "UAA" "UGA" "UAG"
codons$aminoAcid
##  [1] "A" "A" "A" "A" "R" "R" "R" "R" "R" "R" "N" "N" "D" "D" "C" "C" "Q"
## [18] "Q" "E" "E" "G" "G" "G" "G" "H" "H" "I" "I" "I" "M" "L" "L" "L" "L"
## [35] "L" "L" "K" "K" "F" "F" "P" "P" "P" "P" "S" "S" "S" "S" "S" "S" "T"
## [52] "T" "T" "T" "W" "Y" "Y" "V" "V" "V" "V" "X" "X" "X"
codons[ 1, 2 ]
## [1] "A"
codons[ 2, 1 ]
## [1] "GCC"

Calculating a new column

We can start by creating a new column called type that contains all NA values:

codons$type <- NA
head(codons)
##   codon aminoAcid type
## 1   GCU         A   NA
## 2   GCC         A   NA
## 3   GCA         A   NA
## 4   GCG         A   NA
## 5   CGU         R   NA
## 6   CGC         R   NA

nonpolar  <- c( "A", "C", "G", "I", "L", "M", "F", "P", "W", "V" )
polar     <- c( "N", "Q", "S", "T", "Y"                          )
acidic    <- c( "D", "E"                                         )
basic     <- c( "R", "H", "K"                                    )
length( c( nonpolar, polar, acidic, basic ) ) == 20
## [1] TRUE

Which rows contain nonpolar amino acids?

codons$aminoAcid %in% nonpolar
##  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
## [23]  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [34]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [56] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE

We can use these logical vectors to assign our type annotations:

codons[ codons$aminoAcid %in% nonpolar, "type" ] <- "nonpolar"
codons[ codons$aminoAcid %in% polar,    "type" ] <- "polar"
codons[ codons$aminoAcid %in% acidic,   "type" ] <- "acidic"
codons[ codons$aminoAcid %in% basic,    "type" ] <- "basic"

Check to see if it worked!

Homework excercise

Implement transcription and translation

The challenge – design an R script with a set of functions that will:

  • Read the DNA sequence in the file data/npl3-dna.txt"
  • Transcribe it into RNA sequence
  • Save the results to a new file, rna.txt

And the special challenge:

  • Translate the DNA sequence to protein (using the "data/codons.txt")

Parts list:

  • readLines() hint con = "dna.txt"
  • gsub() hint fixed = true
  • write()