A practical introduction to working with data in R, including working with variables, functions and importing data. Everything you need to know how to do in R to avoid ever having to use Excel again!

The command line is the primary mechanism that you’ll use to interact with R. When you enter instructions the R interpreter will perform computations for you.

While this may seem like an arcane method for interacting with software, it has one *huge* advantage over point-and-click environments: it is incredibly easy to repeat or abstract computations that you need to do often or on very large data sets. Any instruction that you type at the R command line can also be saved to a “R script” file. These files are just plain text files (by convention, R scripts have a “.r” or “.R” at the end of the file name). Running an R script is identical to hand typing all of the commands in that script at the command line.

Let’s try entering some simple math expressions to see how this interaction with the command line works. In the code examples you’ll see on this site the commands you type are followed by the text or visual output that R produces.

What do you think will happen when you enter these commands? Try it out.

```
1 + 2
2 * 3
4 ^ 5
6.7 / 8.9
```

As you might expect, R recognizes the standard syntax for numbers and mathematical operators. When you’ve enter a complete expression at the command line and hit “enter”, R evaluates the result of that expression. If you don’t tell R what to do with the result, it will just print out a representation of the value on the next line.

Manual pages for all of the functions in base R and R packages are built into the language. To quickly get help for any function you can use the `?`

syntax.

To get the help page for the `c()`

:

```
## ?c
## help(c)
```

If you are working in RStudio you can search for help using the search box on the Help tab (lower right panel by default). You can also get help by putting your cursor over a function name in the editor or console and hitting F1.

Often, though, you will want to perform a calculation and save the result for later use. If you’ve used Excel, you’ve probably used cells on a worksheet to hold the results of calculations based on the data in other cells. In R we can save the results of computations in variables. To do this we’ll use the assignment operator, which is a little arrow in R `<-`

.

If you’re coming from another programming language and the arrow syntax bugs you, you *can* use `=`

operator for general assignment – see the R help pages for a few important edge cases.

```
a <- 10
a
```

`## [1] 10`

```
b <- a + 11
b
```

`## [1] 21`

```
c <- a / b
c
```

`## [1] 0.4762`

If you’re working with very large numbers you can use for scientific notation:

`2e10`

`## [1] 2e+10`

`2 * 10^10`

`## [1] 2e+10`

`2e10 == 2 * 10^10`

`## [1] TRUE`

In that last line we used the comparison operator `==`

; it tests whether or not two values are equivalent.

Finally, let’s see what happens if we try hitting `enter`

before we’ve finished entering a complete expression. Type `2 +`

and then hit `enter`

. You’ll see a little `+`

shows up as your prompt on the next line. This is R’s way of telling you the text you’ve entered so far isn’t a complete thought. Finish this expression by entering another number and hitting `enter`

again.

In the above expressions we appeared to be doing computations on single numbers. In fact, something more complicated was going on under the hood. In R all data are actually vectors of data.

Unlike other programming languages, there are no atomic values in R; the most basic data structure is a vector. Single values are just vectors with one element. This may seem odd at first, until you consider the key implication: all operators and functions in R are built to handle vectors of data, not just single values. This means anything that can be done with a single number can also be done with a vector of numbers. This is a language that was clearly designed by statisticians!

You can see how many elements a vector holds using the `length`

function:

`length(10)`

`## [1] 1`

`length(c)`

`## [1] 1`

`length(1:10)`

`## [1] 10`

Above, `length`

is the first example we’ve seen of an R function. In programming, functions are analogous to their mathematical counter parts: they take in one or more values and evaluate to a new value. The R syntax for running, or “calling”, a function is in the form of `functionName(value1, value2, ...)`

.

The `length`

function has reported that we’ve been working with vectors of 1 element so far. To **c**ompose a vector with more than one element, we’ll use the `c()`

function:

`c(1,2,3,4)`

`## [1] 1 2 3 4`

```
d <- c(5,6,7,8)
d + 10
```

`## [1] 15 16 17 18`

`d + d`

`## [1] 10 12 14 16`

As you can see, mathematical operators in R are built to handle vectorized operations: we could add a 4 element vector `d`

to a one element vector `10`

and get a sensible result.

**Side note for programmers**: if you’re coming to R with a background in other programming languages, you might have caught something in the last code block that freaked you out. We had previously assigned the variable `c`

to hold a number; yet we were still able to call the built-in `c`

function from this scope. In almost any other common scripting language this would not have worked. Although base scoping rules in R are strictly lexical, there are a number of aggressive additional checks that are performed on lookup or evalution failures. For example, when an attempt is made to apply a non-function in an inner scope, the interpreter will ascend the scope chain looking for upvalues that *are* functions. This is why we were able to bind `c`

to a vector of numbers in our current scope and also apply the base R `c()`

function. Another example of this design philosophy at play in R is aggressive partial matching for named arguments in function calls.

So far we’ve just been working with numbers (`numeric`

values in R lingo), but R also supports text (`string`

values) and true/false (`boolean`

values) data.

To create strings, surround your text with either double `" ... "`

or single `' ... '`

quotes:

`"a"`

`## [1] "a"`

`"a" == 'a'`

`## [1] TRUE`

`c( "a", "b", "c", "d" )`

`## [1] "a" "b" "c" "d"`

You can create strings containing single quotes using double quotes and visa versa, but if you need to make a string that contains both single and double quotes you need to use the `\`

“escape” character:

```
s <- "My data are \"awesome\"!"
cat(s)
```

`## My data are "awesome"!`

Two other special string characters are tab `\t`

and newline `\n`

:

```
s <- "a\tb\tc"
cat(s)
```

`## a b c`

```
s <- "a\nb\nc"
cat(s)
```

```
## a
## b
## c
```

To create boolean values use `TRUE`

or `FALSE`

:

`TRUE`

`## [1] TRUE`

`FALSE`

`## [1] FALSE`

`TRUE == FALSE`

`## [1] FALSE`

R has built-in support for flagging values as missing data. The special `NA`

value can be mixed with any other kind of data in a vector. For example:

`c(1, 2, NA, 4)`

`## [1] 1 2 NA 4`

`c( "a", NA, "c", NA )`

`## [1] "a" NA "c" NA`

`c(TRUE, FALSE, NA, FALSE)`

`## [1] TRUE FALSE NA FALSE`

`is.na( c(1, NA) )`

`## [1] FALSE TRUE`

Most R functions will either understand how to deal with missing data, or issue an error if they involve a type of statistical analysis that can’t be used with missing data.

R also has a special “no-value” type called `NULL`

. If you are coming to R from another programming language it is easy confuse `NA`

and `NULL`

(for example, in Python data analysis modules the `None`

type is often used to do double duty, signifying `NULL`

or `NA`

depending on the context). By convention, you should use `NA`

in data structures to represent missing data points.

`NULL`

is used to signify unassigned variables:

```
NULL
is.null(NULL)
```

`## [1] TRUE`

Let’s say we have a vector of numbers:

`myNumbers <- c( 10, 20, 30, 40, 50 )`

We can extract elements from 1D vectors using the index syntax `[]`

and integers:

`myNumbers`

`## [1] 10 20 30 40 50`

`myNumbers[1]`

`## [1] 10`

`myNumbers[3]`

`## [1] 30`

Here, we’ve extracted elements at the position given by the integer we put inside of the `[...]`

. Remember that the numbers `1`

and `3`

in the code above are actually *vectors* of integers.

We can use integer vectors with more than one element inside of our index `[...]`

’s::

`myNumbers[ c(1, 3) ]`

`## [1] 10 30`

You can use the `:`

operator to easily create a sequence of numbers:

`2:4`

`## [1] 2 3 4`

`myNumbers[2:4]`

`## [1] 20 30 40`

In addition to putting integer vectors inside of the index `[...]`

we can also use logical vectors. If we do, `TRUE`

at a position causes a value to be extracted, while a `FALSE`

indicates that it should be skipped. Let’s look at an example:

`myNumbers`

`## [1] 10 20 30 40 50`

`myNumbers[ c(FALSE, TRUE , TRUE , TRUE , TRUE ) ]`

`## [1] 20 30 40 50`

`myNumbers[ c(TRUE , FALSE, FALSE, FALSE, FALSE) ]`

`## [1] 10`

So why would you ever want to do this? The answer lies in the combination of indexing and the logical operators (`>`

, `<`

, `==`

, `!=`

, and `%in%`

).

Logical operators always return a logical vector:

`myNumbers > 25`

`## [1] FALSE FALSE TRUE TRUE TRUE`

`myNumbers < 25`

`## [1] TRUE TRUE FALSE FALSE FALSE`

`myNumbers == 30`

`## [1] FALSE FALSE TRUE FALSE FALSE`

`myNumbers != 30`

`## [1] TRUE TRUE FALSE TRUE TRUE`

The %in% asks if the first set of numbers can be found in the second:

`30 %in% myNumbers`

`## [1] TRUE`

`c(10, 100) %in% myNumbers`

`## [1] TRUE FALSE`

The ! operator negates (flips) each value of a logical vector:

`!TRUE`

`## [1] FALSE`

`!(myNumbers > 25)`

`## [1] TRUE TRUE FALSE FALSE FALSE`

**A note about = vs ==:** Many beginers are confused by the difference between

`=`

and `==`

. The `=`

operator is used for value assignment, traditionally for arguments inside of function calls such as `plot(x = 10, y = 1)`

, or in newer versions of R in place of the `<-`

operator as in `a = 10`

. If you want to compare `==`

operator. These operations will evaluate to a logical vector (`TRUE`

or `FALSE`

).So how can we combine logical comparisons with indexing?

`myNumbers[myNumbers > 25]`

`## [1] 30 40 50`

`myNumbers[myNumbers < 25]`

`## [1] 10 20`

You can get fancy…

`myNumbers[ (myNumbers %% 2) == 0 ]`

`## [1] 10 20 30 40 50`

What happened there? If you need help figuring it out, look up the `%%`

(*modulo*) operator on the help panel.

Finally, the indexing `[...]`

syntax isn’t just used to extract values from data structures. It can also be used to assign values *into* existing structures. For example:

`myNumbers`

`## [1] 10 20 30 40 50`

```
myNumbers[3] <- 100
myNumbers
```

`## [1] 10 20 100 40 50`

```
myNumbers[2:3] <- c(1,2)
myNumbers
```

`## [1] 10 1 2 40 50`

As we saw in the previous section, vectors are the basic building blocks of all data in R and can hold numeric, string or boolean values. Vectors can in turn be composed into more complicated data structures including matrixes, arrays, data frame’s and lists.

A matrix is a vector of vectors, each the same length and with the same type of data:

```
m <- matrix(1:8, nrow = 2, ncol = 4)
m
```

```
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
```

You access values on a matrix by using a one element index, refering to a n’th position:

`m[2]`

`## [1] 2`

Alternatively you can specify a `[row, col]`

:

`m[1,2]`

`## [1] 3`

Or just a row:

`m[1,]`

`## [1] 1 3 5 7`

Or just a column:

`m[,2]`

`## [1] 3 4`

If you forget this syntax, just pay attention to how R prints out matrixes!

An array is a matrix of more than two dimensions.

`array(1:8, dim=c(2,2,2))`

```
## , , 1
##
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## , , 2
##
## [,1] [,2]
## [1,] 5 7
## [2,] 6 8
```

Unlike arrays and matrixes, lists are collections of vectors where the individual vectors can hold different types of data:

```
l <- list( a = c(1, 2, 3, 4)
, b = c("a", "b", "c")
)
l
```

```
## $a
## [1] 1 2 3 4
##
## $b
## [1] "a" "b" "c"
```

You can access individual vectors on lists using indexing with numbers or names:

`l[1]`

```
## $a
## [1] 1 2 3 4
```

`l["a"]`

```
## $a
## [1] 1 2 3 4
```

Did you notice what type of thing was returned there?

To simplify the result of indexing down to a vector (rather than a one element list):

`l[[1]]`

`## [1] 1 2 3 4`

The `$`

is short hand for referencing a named element on a list:

`l$a`

`## [1] 1 2 3 4`

R has great built-in support for working with data in tabular format. Tables in R are called “data frames.” By convention, response and annotation variables are arranged across the columns and observations down the rows. Columns, and optionally rows, can also be given unique names.

Under the hood, data.frame structures are just a specialized kind of a list – where each component vector (column) is of the same length.

Before we dive into loading and working with tabular data in R, it’s worth taking a moment to consider a key difference between data formatting expectations in advanced statistical software packages like R and the bad habits most folks develop after years of working in Excel. If you’re used to working in other stats packages like SAS, SPSS or Minitab, you can skip this section.

Let’s consider a simple example experimental design: a response variable measured in two different treatment groups (A, B) over a 4 day period.

Excel has probably trained you to format data something like this:

Day | Group A | Group B |
---|---|---|

1 | 5 | 5 |

2 | 6 | 7 |

3 | 7 | 9 |

4 | 8 | 11 |

The reason we have all learned to format the data this way in Excel is that it makes it easy to produce plots – if we select these cells and click the scatter plot wizard, we’ll get the desired plot with Day on the X-axis and two sets of points, one for Group A and the other Group B.

Statisticians, and by proxy statistical software packages, object to this formatting for an important reason. The problem is that we’ve mixed our concerns in designing the structure of these columns: (1) *two* different columns contain values for the *same* response being measured; (2) a second variable in this design (treatment) has to be inferred from the column headings.

The correct design would be a three column table:

Day | Group | Response |
---|---|---|

1 | A | 5 |

1 | B | 5 |

2 | A | 6 |

2 | B | 7 |

3 | A | 7 |

3 | B | 9 |

4 | A | 8 |

4 | B | 11 |

If you have a lot of data formatted in the first of these two formats, don’t worry. Restructuring your tables is easy to do in R, as is generating any new categorical label columns you might need. We can explore this topic in more detail if there’s interest, but as a preview the tools you’ll probably need are the `c(..., recursive = TRUE)`

and `rep()`

functions.

R can import tabular data from a wide variety of source file formats. Base R has excellent support for loading data using the `read.table`

family of functions. There are also a wide array of R packages that support loading data from databases and other binary file formats. If you just need to move data from an Excel worksheet into R, the easiest path is to save it as a text file (tab-delimited or csv) and load it into R using `read.table`

.

If you’re working in RStudio, you can use the “Import Dataset” button on the Workspace tab to load data from a local file or over the web. Under-the-hood RStudio is just calling `read.table`

for you, which we’ll explore below (see the History tab see the command that RStudio generated for you).

To follow along with this example, you can download the genetic code table and save it in your current working directory: codons.txt. This table has two columns: “codon” and “aminoAcid.” To load the table into a variable:

```
codons <- read.table( "data/codons.txt"
, header = TRUE
, stringsAsFactors = FALSE
)
head(codons)
```

```
## codon aminoAcid
## 1 GCU A
## 2 GCC A
## 3 GCA A
## 4 GCG A
## 5 CGU R
## 6 CGC R
```

The `read.table`

function takes a large number of optional arguments which allows it to adapt to a wide variety of different file formats. Here we’ve specified `header = TRUE`

because the first line of our file contains column headings. The `stringsAsFactors = FALSE`

argument tells R not to try to convert text columns to a special type of data structure called a `factor`

. Factors are intended to flag strings as describing levels of a categorical variable. They are a more advanced topic then we’ll dive into here; so we’ll turn them off.

Once your data is loaded into a data.frame (table), you can access vectors of data for individual variables in the table using the `$`

syntax:

`codons$codon`

```
## [1] "GCU" "GCC" "GCA" "GCG" "CGU" "CGC" "CGA" "CGG" "AGA" "AGG" "AAU"
## [12] "AAC" "GAU" "GAC" "UGU" "UGC" "CAA" "CAG" "GAA" "GAG" "GGU" "GGC"
## [23] "GGA" "GGG" "CAU" "CAC" "AUU" "AUC" "AUA" "AUG" "UUA" "UUG" "CUU"
## [34] "CUC" "CUA" "CUG" "AAA" "AAG" "UUU" "UUC" "CCU" "CCC" "CCA" "CCG"
## [45] "UCU" "UCC" "UCA" "UCG" "AGU" "AGC" "ACU" "ACC" "ACA" "ACG" "UGG"
## [56] "UAU" "UAC" "GUU" "GUC" "GUA" "GUG" "UAA" "UGA" "UAG"
```

`codons$aminoAcid`

```
## [1] "A" "A" "A" "A" "R" "R" "R" "R" "R" "R" "N" "N" "D" "D" "C" "C" "Q"
## [18] "Q" "E" "E" "G" "G" "G" "G" "H" "H" "I" "I" "I" "M" "L" "L" "L" "L"
## [35] "L" "L" "K" "K" "F" "F" "P" "P" "P" "P" "S" "S" "S" "S" "S" "S" "T"
## [52] "T" "T" "T" "W" "Y" "Y" "V" "V" "V" "V" "X" "X" "X"
```

If we want to access a single data point we can use indexing syntax with `[]`

. When we are working with a 2D data structure, we can specific a `[row, col]`

:

`codons[ 1, 2 ]`

`## [1] "A"`

`codons[ 2, 1 ]`

`## [1] "GCC"`

We can use the `$`

syntax (like the `[...]`

) to assign data to existing columns on a `data.frame`

or to create a new column. Let’s say that we want to add a new column to our `codons`

table that will annotate what type of amino acid is encoded by each codon (non-polar, polar, acidic or basic).

We can start by creating a new column called `type`

that contains all `NA`

values:

```
codons$type <- NA
head(codons)
```

```
## codon aminoAcid type
## 1 GCU A NA
## 2 GCC A NA
## 3 GCA A NA
## 4 GCG A NA
## 5 CGU R NA
## 6 CGC R NA
```

Here could have hand-encoded a vector of 20 strings describing the type of each amino acid in our table. But we’ll take the lazier path and learn a few new R tricks along the way. Let’s make some vectors that describe which amino acids belong to each of the four categories:

```
nonpolar <- c( "A", "C", "G", "I", "L", "M", "F", "P", "W", "V" )
polar <- c( "N", "Q", "S", "T", "Y" )
acidic <- c( "D", "E" )
basic <- c( "R", "H", "K" )
length( c( nonpolar, polar, acidic, basic ) ) == 20
```

`## [1] TRUE`

Now we can update our `type`

column using the annotations that we’ve saved in these variables. The logical `%in%`

operator tests whether or not one vector of values (left side) is found in another (right side). It returns a vector of boolean values of the same length as the left-hand test vector. So we can do this:

Which rows contain nonpolar amino acids?

`codons$aminoAcid %in% nonpolar`

```
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
## [23] TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [34] TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [56] FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
```

We can use these logical vectors to assign our type annotations:

```
codons[ codons$aminoAcid %in% nonpolar, "type" ] <- "nonpolar"
codons[ codons$aminoAcid %in% polar, "type" ] <- "polar"
codons[ codons$aminoAcid %in% acidic, "type" ] <- "acidic"
codons[ codons$aminoAcid %in% basic, "type" ] <- "basic"
```

Check to see if it worked!

Pretty neat, eh? Working with numeric data in table columns is even more straight forward. In R, it’s very easy to create new columns that are calculated from exisiting data, as you might be used to doing in Excel. In R, however, adding complex annotation columns like `type`

above is also very simple.

Notice how we accomplished this task by composing a few minimal data structures, followed by a few relatively straight-forward assignments. We never had to repeat assignment of any of our amino acid types. In both software design and data analysis we always try to adhere to the “[DRY](http://en.wikipedia.org/wiki/Don't_repeat_yourself)” (don’t repeat yourself) principle.

This is much easier to do when working with tabular data in R than it is in traditional spreadsheet packages or other statistical environments.

The challenge – design an R script with a set of functions that will:

- Read the DNA sequence in the file data/npl3-dna.txt"
- Transcribe it into RNA sequence
- Save the results to a new file, rna.txt

And the special challenge:

- Translate the DNA sequence to protein (using the “data/codons.txt”)

`readLines()`

*hint*con = “dna.txt”`gsub()`

*hint*fixed = true`write()`