Start the exercise

To update your project with the data and .rmd file for this exercise, run:

bio297::start()

Mechanical Turk survey

Download the file johnsonlab.xlsx from the data/ subfolder and take a quick look at it in Excel.

Reading from an Excel file

I've already installed readxl for you, but if I hadn't you could install this package with:

install.packages("readxl")

To load the functions in readxl into our current session, we'll use library:

library(readxl)

Packages and help pages

At the console you can use:

ls("package:readxl")
## [1] "excel_sheets" "read_excel"

Loading data from Excel

The read_excel function seems simple enough! Let's use it to load data from the first sheet:

jd <- read_excel("data/johnsonlab.xlsx")

Why didn't we have to specify a sheet argument above?

Getting the lay of the land

View(jd)

  • The data is tidy: each variable has a column, each observation a row
  • Variable names are formatted consistently
  • Variable names start with letters and have no spaces

These data are also a nice mix of variable types. We have:

  • Continuous data: VacuumTime
  • Discrete data: VacuumUnderstanding
  • Categorical data: Gender

class(jd$VacuumTime)
## [1] "numeric"
class(jd$VacuumUnderstanding)
## [1] "numeric"

Now let's check on a categorical variable:

class(jd$Gender)
## [1] "numeric"

The facts about Factors

  • Use character to hold arbitrary text: for example codon sequences
  • Use factor to hold true categorical variables (Gender) with defined levels (male, female).

gender <- factor( c("male", "female") )
gender
## [1] male   female
## Levels: female male

levels(gender)
## [1] "female" "male"

You can index factors just like any other type of vector:

gender[ c(1, 1, 2, 2, 1) ]
## [1] male   male   female female male  
## Levels: female male

Using factors to model categorical data

We can replace the current numeric column with a factor:

jd$Gender <- gender[ jd$Gender ]

Check to make sure that worked and verify that you understand why it did!

# Enter your code here!

Summary statistics

You can call summary on an entire data frame:

summary(jd)

Or just one vector:

summary(jd$VacuumTime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.519   3.286   4.384   6.092   6.696  95.030
summary(jd$Gender)
## female   male 
##    118     58

Basic plotting

Just like summary, the plotting functions like plot and boxplot are also a little bit magical in R.

We can plot one numeric column:

plot(jd$VacuumTime)

Or make a scatter plot with two:

plot(jd$VacuumTime, jd$VelcroTime)

Formula syntax

So that last scatter plot in formula syntax would be:

plot(VelcroTime ~ VacuumTime, data = jd)

boxplot(Age ~ Gender, data = jd)

Your turn

Use plot and boxplot to explore several other interactions in this data set!

# Enter your code here!

After class

  1. Finish this exercise (fill in all of the "# Enter your code here!" blocks). Check for errors by clicking on "Knit HTML" and looking over the document.
  2. When you're ready, use bio297::submit("03-tidy-data-1.rmd") to submit the assignment.
  3. Read Wickham 2014 (in "Resources", "Literature" on Sakai) for next class.