View as a slideshow.

Joining Data Sets

So far we’ve limited our analyses to variables contained within a given table. In real projects, however, you frequently need to join information contained in multiple source tables. Let’s look at the tools dplyr provides to make this easy. As always a picture is worth a thousand words and the Data Transformation cheat sheet is a great reference (see the Combine Tables section).

Going back to the GISS Surface Temperature data we looked at in the last class, you’ll see that individual tables are available for Norther and Southern hemisphere observations.

Use what we learned last class to download these two files (Shell -> wget):

For my code below, I’ve saved them to a data subfolder of my current working directory and named them northern-mean.csv and southern-mean.csv, respectively.

Take a look at the files in the R Studio editor and clean up as necessary. You’ll notice that these are slightly less aweful than the file I pointed you to last class; mea culpa, I wanted to show you how bad it could get!

For my code below, clean versions of these files are named northern-mean-clean.csv and southern-mean-clean.csv, respectively.

These files are in comma-separated format. You could use read.table to load them and set the separator to a "," or use the shortcut function named read.csv:

northern_raw <- read.csv("data/northern-mean-clean.csv", na.strings = "***")
southern_raw <- read.csv("data/southern-mean-clean.csv", na.strings = "***")

As we did in the last class (1) we want to tidy up these tables, (2) clean up column names and (3) make a numeric month column. Since we want to do all of those things to both tables, let’s make a little function!

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
tidyTemps <- function(table) {
  # gather
  tidy <- gather(table[1:13], key = "month", value = "index", Jan:Dec)
  
  # lower case all variable names
  colnames(tidy) <- tolower(colnames(tidy))
  
  # make numeric month column
  month_n <- 1:12
  names(month_n) <- colnames(table[2:13])
  tidy$month_n <- month_n[tidy$month]
  
  # Functions return their last value; return the tidy table
  tidy
}

Now we can use our function to tidy and clean up each table:

northern <- tidyTemps(northern_raw)
southern <- tidyTemps(southern_raw)

Let’s say we’d like to look at a correlation between temperatures in the two hemispheres. To do that we need to get all of the data into the same table. We can do that with one of the join functions from dplyr. In this case, our tables both have the same number of rows, so it really doesn’t matter which we choose. I’ll use left_join here.

temps <- left_join(northern, southern, by = c("year", "month_n"))

Take a look at the result and make sure you understood what happened there. Let’s do some clean up:

temps$month.x <- NULL
temps$month.y <- NULL
colnames(temps)[2] <- "northern"
colnames(temps)[4] <- "southern"

Explore

Is this table Tidy? Well, yes and no! If we want to look at a correlation between northern and southern hemisphere temperatures, it is. Make that plot now!

What if we want to look at plots of Northern and Southern temperatures as a function of time? Do the tidying that you would need to and make that plot!

Then try:

  • Play with geom_smooth to add a kernal trend line (which is this a nice approach given these data)
  • If you’re feeling ambition try fitting a linear model to (some) of the data using lm
  • Make it interactive! Allow zooming on the x-axis.