Working with simple data

Interacting with R

The command line is the primary mechanism that you’ll use to interact with R. When you enter instructions the R interpreter will perform computations for you.

While this may seem like an arcane method for interacting with software, it has one huge advantage over point-and-click environments: it is incredibly easy to repeat or abstract computations that you need to do often or on very large data sets. Any instruction that you type at the R command line can also be saved to a “R script” file or an “R Markdown” document. These files are just plain text files (by convention, R scripts have a “.r” or “.R” at the end of the file name and markdown files end in “.rmd”). Running an R script or documents is identical to hand typing all of the commands in that script at the command line.

Let’s try entering some simple math expressions to see how this interaction with the command line works. In the code examples you see on this site the commands you type are followed by the text or visual output that R produces.

What will happen when you enter these commands? Try it out.

1 + 2
2 * 3
4 ^ 5
6.7 / 8.9

No surprises; R can be used as a calculator. R recognizes the standard syntax for numbers and mathematical operators. When you’ve enter a complete expression at the command line and hit “enter”, R evaluates the result of that expression. If you don’t tell R what to do with the result, it will just print out a representation of the value on the next line.

Saving data in variables

Usually you will want to perform a calculation and save the result for later use. If you’ve used Excel, you’ve probably used cells on a worksheet to hold the results of calculations based on the data in other cells. In R we can save the results of computations in variables. To do this we’ll use the assignment operator, which is a little arrow <-.

If you’re coming from another programming language and the arrow syntax bugs you, you can use = operator for general assignment – see the R help pages for a few important edge cases.

a <- 10
a
## [1] 10
b <- a + 11
b
## [1] 21
c <- a / b
c
## [1] 0.4761905

If you’re working with very large numbers you can use scientific notation:

2e10
## [1] 2e+10
2 * 10^10
## [1] 2e+10
2e10 == 2 * 10^10
## [1] TRUE

In that last line we used the comparison operator ==; it tests whether or not two values are equivalent.

Finally, let’s see what happens if we try hitting enter before we’ve finished entering a complete expression. Type 2 + and then hit enter. You’ll see a little + shows up as your prompt on the next line. This is R’s way of telling you the text you’ve entered so far isn’t a complete thought. Finish this expression by entering another number and hitting enter again.

Everything is a vector

In the above expressions we appeared to be doing computations on single numbers. In fact, something more complicated was going on under the hood. In R all data are actually vectors of values.

Unlike other programming languages, there are no scalar values in R; the most basic data structure is a vector. Single values are just vectors with one element. This may seem odd at first, until you consider the key implication: all operators and functions in R are built to handle vectors of data, not just single values. This means anything that can be done with a single number can also be done with a vector of numbers. This is a language that was clearly designed by statisticians!

You can see how many elements a vector holds using the length function:

length(10)
## [1] 1
length(c)
## [1] 1
length(1:10)
## [1] 10

Above length is the first example we’ve seen of an R function. In programming, functions are analogous to their mathematical counter parts: they take in one or more values and evaluate to a new value. The R syntax for running, or “calling”, a function is in the form of functionName(value1, value2, ...). We’ll explore functions in more depth on Monday.

Composing vectors

The length function has reported that we’ve been working with vectors of 1 element so far. To compose a vector with more than one element, we’ll use the c() function:

c(1,2,3,4)
## [1] 1 2 3 4
d <- c(5,6,7,8)
d + 10
## [1] 15 16 17 18
d + d
## [1] 10 12 14 16

As you can see, mathematical operators in R are built to handle vectorized operations: we could add a 4 element vector d to a one element vector 10 and get a sensible result.

Side note for programmers: if you’re coming to R with a background in other programming languages, you might have caught something in the last code block that freaked you out. We had previously assigned the variable c to hold a number; yet we were still able to call the built-in c function from this scope. In almost any other common scripting language this would not have worked. Although base scoping rules in R are strictly lexical, there are a number of aggressive additional checks that are performed on look-up failures. For example, when an attempt is made to apply a non-function in an inner scope, the interpreter will ascend the scope chain looking for up-values that are functions. This is why we were able to bind c to a vector of numbers in our current scope and still use the base R c() function. Another example of this design philosophy at play in R is aggressive partial matching for named arguments in function calls. Core R is designed with fallbacks galore to give it a good shot at being able to execute whatever mess you hand it.

Strings

So far we’ve just been working with numbers (numeric values in R lingo), but R also supports text (string values) and True/False (boolean values) data.

To create strings, surround your text with either double " ... " or single ' ... ' quotes:

"a"
## [1] "a"
"a" == 'a'
## [1] TRUE
c( "a", "b", "c", "d" )
## [1] "a" "b" "c" "d"

Escape characters

You can create strings containing single quotes using double quotes and visa versa, but if you need to make a string that contains both single and double quotes you need to use the \ “escape” character:

s <- "My data are \"awesome\"!"
cat(s)
## My data are "awesome"!

Two other special string characters are tab \t and newline \n:

s <- "a\tb\tc"
cat(s)
## a    b   c
s <- "a\nb\nc"
cat(s)
## a
## b
## c

Boolean values

To create Boolean values use TRUE or FALSE:

TRUE
## [1] TRUE
FALSE
## [1] FALSE
TRUE == FALSE
## [1] FALSE

Missing Data

R has built-in support for flagging values as missing data. The special NA value can be mixed with any other kind of data in a vector. For example:

c(1, 2, NA, 4)
## [1]  1  2 NA  4
c( "a", NA, "c", NA )
## [1] "a" NA  "c" NA
c(TRUE, FALSE, NA, FALSE)
## [1]  TRUE FALSE    NA FALSE
is.na( c(1, NA) )
## [1] FALSE  TRUE

Most R functions will either understand how to deal with missing data, or issue an error if they involve a type of statistical analysis that can’t be used with missing data.

A note about NULL

R also has a special “no-value” type called NULL. If you are coming to R from another programming language it is easy confuse NA and NULL (for example, in Python data analysis modules the None type is often used to do double duty, signifying NULL or NA depending on the context). By convention, you should use NA in data structures to represent missing data points.

NULL is used to signify unassigned variables:

NULL
## NULL
is.null(NULL)
## [1] TRUE

Indexing syntax in R

Extracting values

Let’s say we have a vector of numbers:

myNumbers <- c( 10, 20, 30, 40, 50 )

We can extract elements from 1D vectors using the index syntax [] and integers:

myNumbers
## [1] 10 20 30 40 50
myNumbers[1]
## [1] 10
myNumbers[3]
## [1] 30

Here, we’ve extracted elements at the position given by the integer we put inside of the [...]. Remember that the numbers 1 and 3 in the code above are actually vectors of integers.

We can use integer vectors with more than one element inside of our index [...]’s::

myNumbers[ c(1, 3) ]
## [1] 10 30

You can use the : operator to easily create a sequence of numbers:

2:4
## [1] 2 3 4
myNumbers[2:4]
## [1] 20 30 40

In addition to putting integer vectors inside of the index [...] we can also use logical vectors. If we do, TRUE at a position causes a value to be extracted, while a FALSE indicates that it should be skipped. Let’s look at an example:

myNumbers
## [1] 10 20 30 40 50
myNumbers[ c(FALSE, TRUE,  TRUE,  TRUE,  TRUE ) ]
## [1] 20 30 40 50
myNumbers[ c(TRUE,  FALSE, FALSE, FALSE, FALSE) ]
## [1] 10

So why would you ever want to do this? The answer lies in the combination of indexing and the logical operators (>, <, ==, !=, and %in%).

Logical operators always return a logical vector:

myNumbers > 25
## [1] FALSE FALSE  TRUE  TRUE  TRUE
myNumbers < 25
## [1]  TRUE  TRUE FALSE FALSE FALSE
myNumbers == 30
## [1] FALSE FALSE  TRUE FALSE FALSE
myNumbers != 30
## [1]  TRUE  TRUE FALSE  TRUE  TRUE

The %in% operator asks if the first set of numbers can be found in the second:

30 %in% myNumbers
## [1] TRUE
c(10, 100) %in% myNumbers
## [1]  TRUE FALSE

The ! operator negates (flips) each value of a logical vector:

!TRUE
## [1] FALSE
!(myNumbers > 25)
## [1]  TRUE  TRUE FALSE FALSE FALSE

A note about = vs ==: Many beginners are confused by the difference between = and ==. The = operator is used for value assignment, traditionally for arguments inside of function calls such as plot(x = 10, y = 1), or in newer versions of R in place of the <- operator as in a = 10. If you want to compare equivalence between two values you’ll want to use the double == operator. These operations will evaluate to a logical vector (TRUE or FALSE).

So how can we combine logical comparisons with indexing?

myNumbers[myNumbers > 25]
## [1] 30 40 50
myNumbers[myNumbers < 25]
## [1] 10 20

You can get fancy…

myNumbers[ (myNumbers %% 2) == 0 ]
## [1] 10 20 30 40 50

What happened there? If you need help figuring it out, look up the %% (modulo) operator on the help panel.

Assigning values

Finally, the indexing [...] syntax isn’t just used to extract values from data structures. It can also be used to assign values into existing structures. For example:

myNumbers
## [1] 10 20 30 40 50
myNumbers[3]    <- 100
myNumbers
## [1]  10  20 100  40  50
myNumbers[2:3]  <- c(1,2)
myNumbers
## [1] 10  1  2 40 50

Bigger data structures

As we saw in the previous section, vectors are the basic building blocks of all data in R and can hold numeric, string or Boolean values. Vectors can in turn be composed into more complicated data structures including matrices, arrays, data frames and lists.

Matrix

A matrix is a vector of vectors, each the same length and with the same type of data:

m <- matrix(1:8, nrow = 2, ncol = 4)
m
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8

You access values on a matrix by using a one element index, referring to a n’th position:

m[2]
## [1] 2

Alternatively you can specify a [row, col]:

m[1, 2]
## [1] 3

Or just a row:

m[1, ]
## [1] 1 3 5 7

Or just a column:

m[, 2]
## [1] 3 4

If you forget this syntax, just pay attention to how R prints out matrices!

Array

An array is a matrix of more than two dimensions. It’s unlikely that you’ll need to work with arrays this term, but it’s good to know what you’re looking at if you see one in the wild:

array(1:8, dim = c(2, 2, 2))
## , , 1
## 
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8

Lists

Unlike arrays and matrices, lists are collections of vectors where the individual vectors can hold different types of data and be of different lengths:

l <- list( a = c(1, 2, 3, 4), b = c("a", "b", "c") )
l
## $a
## [1] 1 2 3 4
## 
## $b
## [1] "a" "b" "c"

You can access individual vectors on lists using indexing with numbers or names:

l[1]
## $a
## [1] 1 2 3 4
l["a"]
## $a
## [1] 1 2 3 4

Did you notice what type of thing was returned there?

To simplify the result of indexing down to a vector (rather than a one element list):

l[[1]]
## [1] 1 2 3 4
l[["a"]]
## [1] 1 2 3 4

The second example above is a common task in R: referring to a named elements on a list to get the value (instead of a list with that value as the only element). Because of this there’s a short hand, the $:

l$a
## [1] 1 2 3 4
l$a == l[["a"]]
## [1] TRUE TRUE TRUE TRUE

Make sure you understand why this fails:

l$a == l["a"]

After class

Finish up the parts of this document we didn’t get to in class today. If you are a beginner and aren’t feeling confident with the basics I’d recommend doing the first 3 excercises on the tryR site. Don’t worry if you’re still feeling a bit confused; come to class on Monday ready to ask questions.

If you’re a computer science guru and were bored to tears with this introduction, I’d recommend reading Evaluating the Design of the R Language. It’s an excellent overview of R language semantics and the standard library which will provides some insights into differences you’re likely to run into between evaluation semantics and idiomatic constructs in R and other languages you’ve seen.