View as a slideshow.

Data

So far we’ve been working with data sets that are already tidy; we’ve loaded them from packages or sensibly formatted files. In the wild, however, it’s rare that you come across data that are ready-to-go like this. On data analysis projects, you usually have to spend a significant portion of time wrangling your data in to a usable state.

For today, let’s take a look at a data set assessing changes in the global land and ocean temperature data sets hosted at the Goddard Institute for Space Studies. I took a guess that NASA would be a good place to find a messy data file to work with and they didn’t disappoint!

The source file we’ll work with today lives here on the internet:

https://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts+dSST.txt.

Open it in a web browser and have a look at it in all its messy glory. Make sure you understand what the numbers mean.

Normally, we’d have three options for fetching data from a file on the internet:

  • As we did before, using the “Import Data set” wizard on the Environment tab and pasting the URL into the box. However, no amount of fiddling is ever going to give us reasonable results with this file. Why?
  • Download the file to your computer and then upload it to your R Studio account using the “Upload” button on the “Files” tab.
  • Skip the middle man and download the file directly. You can do this using the “Shell” found in the “Tools” menu.

I like to keep all of my source data files separate from my script and output files. If you don’t have one already make a folder named ‘data’ in your current working directory (“Files” tab -> “New Folder”). You don’t have to do it this way, but you’ll need to change the file paths below.

To download the file to your current directory in R Studio, open a Shell (this a full bash shell), and run:

wget -O data/global-mean.txt https://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts+dSST.txt
## --2017-09-21 10:47:04--  https://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts+dSST.txt
## Resolving data.giss.nasa.gov (data.giss.nasa.gov)... 129.164.128.233, 2001:4d0:2310:230::233
## Connecting to data.giss.nasa.gov (data.giss.nasa.gov)|129.164.128.233|:443... connected.
## HTTP request sent, awaiting response... 200 OK
## Length: 15988 (16K) [text/plain]
## Saving to: ‘data/global-mean.txt’
## 
##      0K .......... .....                                      100%  586K=0.03s
## 
## 2017-09-21 10:47:04 (586 KB/s) - ‘data/global-mean.txt’ saved [15988/15988]

Note the capitol “-O”, which is specifying our output file.

Fiddle all you like with the data import wizard; you’re not going to get this file to parse correctly!

Cleaning up the input file

Take a moment to count all of the ways this file is messed up. Next week we’ll explore some text processing tools that would allow us to take a more automated approach to cleaning up input files (useful if you have many!), but for now we’ll hand edit the file recording all of the corrections we made.

Open up the text file in RStudio by navigating to your data directory on the “Files” tab and clicking on it.

Hand edit it to:

  • Get rid of all of the header lines (1-7)
  • Get rid of all of the blank lines and repeats of column names (eg. 23-24)
  • Get rid of all of the footer lines (last 7)
  • Leave a single blank line at the end of the file (should be 139).
  • Save it AS A NEW FILE: “global-mean-clean.txt”

We’ll do the rest of the clean up in R.

Reading the table

We can now use R’s read.table function to load the file into a table in R:

raw_temps <- read.table("data/global-mean-clean.txt", 
                        header     = TRUE,
                        na.strings = "****"
                        )

As always, let’s sanity check what kinds of vectors we have in each of our columns:

summary(raw_temps)
##       Year           Jan                Feb               Mar         
##  Min.   :1880   Min.   :-70.0000   Min.   :-61.000   Min.   :-62.000  
##  1st Qu.:1914   1st Qu.:-28.0000   1st Qu.:-24.000   1st Qu.:-24.000  
##  Median :1948   Median : -4.0000   Median : -6.000   Median : -1.000  
##  Mean   :1948   Mean   :  0.6934   Mean   :  2.168   Mean   :  3.781  
##  3rd Qu.:1982   3rd Qu.: 27.0000   3rd Qu.: 30.000   3rd Qu.: 26.000  
##  Max.   :2016   Max.   :117.0000   Max.   :135.000   Max.   :130.000  
##                                                                       
##       Apr               May               Jun                Jul         
##  Min.   :-59.000   Min.   :-54.000   Min.   :-52.0000   Min.   :-48.000  
##  1st Qu.:-26.000   1st Qu.:-25.000   1st Qu.:-25.0000   1st Qu.:-20.000  
##  Median : -5.000   Median : -6.000   Median : -7.0000   Median : -5.000  
##  Mean   :  1.715   Mean   :  1.248   Mean   : -0.4161   Mean   :  2.715  
##  3rd Qu.: 25.000   3rd Qu.: 26.000   3rd Qu.: 16.0000   3rd Qu.: 15.000  
##  Max.   :109.000   Max.   : 93.000   Max.   : 78.0000   Max.   : 83.000  
##                                                                          
##       Aug               Sep               Oct               Nov        
##  Min.   :-51.000   Min.   :-47.000   Min.   :-55.000   Min.   :-56.00  
##  1st Qu.:-20.000   1st Qu.:-17.000   1st Qu.:-19.000   1st Qu.:-19.00  
##  Median : -4.000   Median : -3.000   Median : -1.000   Median : -2.00  
##  Mean   :  2.978   Mean   :  4.701   Mean   :  5.328   Mean   :  3.81  
##  3rd Qu.: 19.000   3rd Qu.: 20.000   3rd Qu.: 20.000   3rd Qu.: 15.00  
##  Max.   : 98.000   Max.   : 90.000   Max.   :106.000   Max.   :104.00  
##                                                                        
##       Dec                J.D               D.N           DJF     
##  Min.   :-78.0000   Min.   :-47.000   -9     :  7   -42    :  6  
##  1st Qu.:-25.0000   1st Qu.:-21.000   -22    :  5   -16    :  5  
##  Median : -8.0000   Median : -7.000   -10    :  4   40     :  4  
##  Mean   :  0.5329   Mean   :  2.438   -2     :  4   -10    :  3  
##  3rd Qu.: 22.0000   3rd Qu.: 19.000   -25    :  4   -17    :  3  
##  Max.   :111.0000   Max.   : 99.000   13     :  3   -18    :  3  
##                                       (Other):110   (Other):113  
##       MAM               JJA               SON              Year.1    
##  Min.   :-56.000   Min.   :-47.000   Min.   :-47.000   Min.   :1880  
##  1st Qu.:-25.000   1st Qu.:-21.000   1st Qu.:-18.000   1st Qu.:1914  
##  Median : -6.000   Median : -6.000   Median : -3.000   Median :1948  
##  Mean   :  2.255   Mean   :  1.737   Mean   :  4.628   Mean   :1948  
##  3rd Qu.: 27.000   3rd Qu.: 16.000   3rd Qu.: 17.000   3rd Qu.:1982  
##  Max.   :111.000   Max.   : 85.000   Max.   : 97.000   Max.   :2016  
## 

We can see D.N and DJF are messed up; why?

If we needed them we would need to convert *** and **** to NAs and then use as.numeric to convert these character vectors to numeric vectors. But we’re going to erase them instead.

Tidy Data

Let’s load the tidyverse:

library(tidyverse)
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

The tidyverse’s definition of Tidy Data is a table of values where:

  • Each row represents one and only one observation
  • Each column represents one and only one variable
  • ALL variables in the design get one column

If all of these things are true, your path to exploring that data set will be much easier than if they aren’t! It also means all of the tidyverse functions will “just work” with your table.

Now let’s take a moment to appreciate all of the ways in which this data set is not Tidy. It:

  • Uses column names to store values of a variable (month of observation)
  • Uses columns differently: some are summaries of others
  • Inexplicably, has two Year columns

Removing unwanted columns

We have a few options for getting rid of the columns we don’t want (remember assigning NULL to a column will erase it). But since we know we just want to keep the first 13 columns (year and months), we can easily index:

raw_temps[1:13]
##     Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1   1880 -30 -21 -18 -27 -14 -29 -24  -8 -17 -16 -19 -22
## 2   1881 -10 -14   1  -3  -4 -28  -7  -3  -9 -20 -26 -16
## 3   1882   9   8   1 -20 -18 -25 -11   3  -1 -23 -21 -25
## 4   1883 -34 -42 -18 -25 -26 -13  -9 -14 -19 -12 -21 -19
## 5   1884 -18 -13 -36 -36 -32 -38 -35 -27 -24 -22 -30 -30
## 6   1885 -66 -30 -24 -45 -42 -50 -29 -27 -19 -20 -22  -7
## 7   1886 -43 -46 -41 -29 -27 -39 -16 -31 -19 -25 -26 -25
## 8   1887 -66 -48 -32 -37 -33 -21 -19 -28 -19 -32 -25 -38
## 9   1888 -43 -43 -47 -28 -22 -20 -10 -11  -7   1   0 -12
## 10  1889 -21  14   4   4  -3 -12  -5 -18 -18 -22 -32 -31
## 11  1890 -48 -48 -41 -38 -48 -27 -30 -36 -36 -23 -37 -30
## 12  1891 -46 -49 -15 -25 -17 -22 -22 -21 -13 -24 -37  -3
## 13  1892 -26 -15 -36 -35 -25 -20 -28 -20 -25 -17 -49 -29
## 14  1893 -69 -51 -24 -32 -35 -24 -14 -24 -18 -16 -17 -38
## 15  1894 -55 -31 -20 -41 -30 -43 -32 -29 -23 -17 -25 -22
## 16  1895 -44 -42 -30 -23 -23 -25 -16 -16  -2 -11 -15 -12
## 17  1896 -23 -15 -29 -33 -19 -13  -6  -9  -5   4 -16 -12
## 18  1897 -22 -19 -12  -1   0 -12  -4  -3  -4 -10 -18 -26
## 19  1898  -6 -34 -55 -33 -35 -20 -22 -22 -19 -32 -35 -22
## 20  1899 -18 -39 -35 -21 -20 -26 -13  -4   0   0  12 -27
## 21  1900 -40  -8   2 -14  -6 -15  -9  -4   1   8 -13 -14
## 22  1901 -30  -5   5  -6 -18 -10  -9 -13 -17 -29 -17 -30
## 23  1902 -19  -3 -29 -27 -31 -34 -26 -28 -20 -27 -36 -46
## 24  1903 -27  -6 -23 -39 -41 -44 -30 -44 -43 -42 -38 -47
## 25  1904 -64 -55 -46 -50 -50 -49 -48 -43 -47 -35 -16 -29
## 26  1905 -38 -59 -25 -36 -33 -31 -25 -21 -15 -23  -8 -21
## 27  1906 -31 -34 -15  -2 -21 -22 -27 -19 -25 -20 -38 -18
## 28  1907 -44 -53 -25 -40 -46 -43 -35 -37 -32 -24 -51 -50
## 29  1908 -46 -36 -58 -46 -40 -39 -35 -45 -33 -43 -51 -50
## 30  1909 -70 -47 -52 -59 -54 -52 -43 -30 -37 -39 -31 -55
## 31  1910 -44 -43 -47 -39 -34 -36 -31 -34 -37 -39 -56 -69
## 32  1911 -64 -60 -62 -55 -51 -47 -41 -43 -38 -26 -20 -25
## 33  1912 -27 -13 -37 -20 -20 -26 -41 -51 -47 -55 -38 -42
## 34  1913 -41 -44 -44 -36 -45 -46 -34 -32 -32 -34 -18  -4
## 35  1914   2 -13 -23 -28 -19 -22 -24 -15 -13  -5 -20 -10
## 36  1915 -20  -1  -8   7  -1 -16  -3 -15 -12 -22 -12 -25
## 37  1916 -20 -23 -31 -25 -27 -44 -34 -27 -29 -28 -42 -78
## 38  1917 -46 -53 -47 -38 -48 -40 -23 -26 -18 -35 -29 -71
## 39  1918 -44 -33 -21 -40 -37 -28 -22 -26 -14  -3 -16 -30
## 40  1919 -21 -19 -25 -17 -20 -28 -21 -19 -17 -16 -29 -35
## 41  1920 -15 -22  -8 -26 -26 -33 -32 -29 -20 -29 -33 -47
## 42  1921  -4 -21 -28 -36 -36 -31 -16 -24 -16  -6 -16 -18
## 43  1922 -34 -44 -13 -22 -34 -32 -27 -31 -29 -33 -17 -17
## 44  1923 -27 -37 -32 -38 -33 -24 -29 -30 -28 -13   3  -6
## 45  1924 -24 -27 -12 -35 -19 -28 -27 -35 -30 -36 -23 -43
## 46  1925 -34 -35 -24 -25 -30 -34 -30 -19 -13 -17   3  11
## 47  1926  20   7  12 -15 -25 -25 -21 -11 -11 -11  -6 -30
## 48  1927 -28 -21 -39 -31 -25 -27 -15 -19  -6  -1  -4 -36
## 49  1928  -4 -12 -28 -29 -30 -41 -21 -25 -20 -19  -9 -20
## 50  1929 -47 -61 -34 -40 -39 -43 -33 -29 -23 -15 -14 -55
## 51  1930 -29 -24  -8 -26 -25 -19 -17 -11 -11  -8  14  -9
## 52  1931 -10 -22  -6 -21 -22  -6   1   0  -6   0 -12 -10
## 53  1932  13 -18 -20  -7 -22 -30 -24 -24 -11 -10 -26 -22
## 54  1933 -34 -32 -29 -23 -25 -32 -20 -23 -26 -24 -31 -47
## 55  1934 -27  -4 -31 -27 -11 -14 -11 -10 -16 -11  -1  -9
## 56  1935 -37  11 -13 -35 -26 -23 -19 -17 -17  -8 -29 -22
## 57  1936 -29 -39 -23 -20 -17 -19  -6 -12  -6  -4  -5  -4
## 58  1937 -11   5 -17 -17  -7  -8  -5   3  14  10   9 -12
## 59  1938   0  -4   5   5  -7 -17  -9  -4   3  11   1 -26
## 60  1939 -13 -12 -20 -12  -7  -8  -6  -5   0  -3   6  40
## 61  1940 -15   6  12  16   5   5  10   1  12   7  13  19
## 62  1941  13  23   6  11  10   4  15  14   2  24  12  14
## 63  1942  26   5  13  14  14  11   2  -3   0   6  13  12
## 64  1943  -1  22   1  13  10  -1  14   3  11  30  25  28
## 65  1944  41  31  34  27  26  22  23  23  31  27  12   5
## 66  1945  13   2  11  24  10   2   7  25  22  22  10 -10
## 67  1946  15   6   0  11  -4 -17  -9  -8  -2  -6  -2 -29
## 68  1947 -13  -8   5   4  -6   0  -6  -8 -14   6  -1 -18
## 69  1948   5 -13 -23  -9   8  -5 -13 -10 -10  -7  -8 -23
## 70  1949   9 -16  -1  -7  -9 -22 -13  -8  -8  -3  -8 -19
## 71  1950 -30 -26  -6 -21 -12  -6  -9 -18 -10 -20 -35 -20
## 72  1951 -35 -44 -19 -10  -2  -5   0   5   7   6   0  15
## 73  1952  16  12 -10   2  -5  -4   5   7   8  -4 -17  -2
## 74  1953   9  16  11  20   8   8   2   8   6   5  -5   3
## 75  1954 -28 -10 -12 -18 -20 -16 -16 -13  -7  -1   8 -18
## 76  1955  11 -21 -36 -23 -20  -8  -9   4 -13  -5 -28 -32
##  [ reached getOption("max.print") -- omitted 61 rows ]

Alternatively, we could use the tidyverse select function (useful inside of a pipe chain):

raw_temps %>%
  select(1:13)
##     Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1   1880 -30 -21 -18 -27 -14 -29 -24  -8 -17 -16 -19 -22
## 2   1881 -10 -14   1  -3  -4 -28  -7  -3  -9 -20 -26 -16
## 3   1882   9   8   1 -20 -18 -25 -11   3  -1 -23 -21 -25
## 4   1883 -34 -42 -18 -25 -26 -13  -9 -14 -19 -12 -21 -19
## 5   1884 -18 -13 -36 -36 -32 -38 -35 -27 -24 -22 -30 -30
## 6   1885 -66 -30 -24 -45 -42 -50 -29 -27 -19 -20 -22  -7
## 7   1886 -43 -46 -41 -29 -27 -39 -16 -31 -19 -25 -26 -25
## 8   1887 -66 -48 -32 -37 -33 -21 -19 -28 -19 -32 -25 -38
## 9   1888 -43 -43 -47 -28 -22 -20 -10 -11  -7   1   0 -12
## 10  1889 -21  14   4   4  -3 -12  -5 -18 -18 -22 -32 -31
## 11  1890 -48 -48 -41 -38 -48 -27 -30 -36 -36 -23 -37 -30
## 12  1891 -46 -49 -15 -25 -17 -22 -22 -21 -13 -24 -37  -3
## 13  1892 -26 -15 -36 -35 -25 -20 -28 -20 -25 -17 -49 -29
## 14  1893 -69 -51 -24 -32 -35 -24 -14 -24 -18 -16 -17 -38
## 15  1894 -55 -31 -20 -41 -30 -43 -32 -29 -23 -17 -25 -22
## 16  1895 -44 -42 -30 -23 -23 -25 -16 -16  -2 -11 -15 -12
## 17  1896 -23 -15 -29 -33 -19 -13  -6  -9  -5   4 -16 -12
## 18  1897 -22 -19 -12  -1   0 -12  -4  -3  -4 -10 -18 -26
## 19  1898  -6 -34 -55 -33 -35 -20 -22 -22 -19 -32 -35 -22
## 20  1899 -18 -39 -35 -21 -20 -26 -13  -4   0   0  12 -27
## 21  1900 -40  -8   2 -14  -6 -15  -9  -4   1   8 -13 -14
## 22  1901 -30  -5   5  -6 -18 -10  -9 -13 -17 -29 -17 -30
## 23  1902 -19  -3 -29 -27 -31 -34 -26 -28 -20 -27 -36 -46
## 24  1903 -27  -6 -23 -39 -41 -44 -30 -44 -43 -42 -38 -47
## 25  1904 -64 -55 -46 -50 -50 -49 -48 -43 -47 -35 -16 -29
## 26  1905 -38 -59 -25 -36 -33 -31 -25 -21 -15 -23  -8 -21
## 27  1906 -31 -34 -15  -2 -21 -22 -27 -19 -25 -20 -38 -18
## 28  1907 -44 -53 -25 -40 -46 -43 -35 -37 -32 -24 -51 -50
## 29  1908 -46 -36 -58 -46 -40 -39 -35 -45 -33 -43 -51 -50
## 30  1909 -70 -47 -52 -59 -54 -52 -43 -30 -37 -39 -31 -55
## 31  1910 -44 -43 -47 -39 -34 -36 -31 -34 -37 -39 -56 -69
## 32  1911 -64 -60 -62 -55 -51 -47 -41 -43 -38 -26 -20 -25
## 33  1912 -27 -13 -37 -20 -20 -26 -41 -51 -47 -55 -38 -42
## 34  1913 -41 -44 -44 -36 -45 -46 -34 -32 -32 -34 -18  -4
## 35  1914   2 -13 -23 -28 -19 -22 -24 -15 -13  -5 -20 -10
## 36  1915 -20  -1  -8   7  -1 -16  -3 -15 -12 -22 -12 -25
## 37  1916 -20 -23 -31 -25 -27 -44 -34 -27 -29 -28 -42 -78
## 38  1917 -46 -53 -47 -38 -48 -40 -23 -26 -18 -35 -29 -71
## 39  1918 -44 -33 -21 -40 -37 -28 -22 -26 -14  -3 -16 -30
## 40  1919 -21 -19 -25 -17 -20 -28 -21 -19 -17 -16 -29 -35
## 41  1920 -15 -22  -8 -26 -26 -33 -32 -29 -20 -29 -33 -47
## 42  1921  -4 -21 -28 -36 -36 -31 -16 -24 -16  -6 -16 -18
## 43  1922 -34 -44 -13 -22 -34 -32 -27 -31 -29 -33 -17 -17
## 44  1923 -27 -37 -32 -38 -33 -24 -29 -30 -28 -13   3  -6
## 45  1924 -24 -27 -12 -35 -19 -28 -27 -35 -30 -36 -23 -43
## 46  1925 -34 -35 -24 -25 -30 -34 -30 -19 -13 -17   3  11
## 47  1926  20   7  12 -15 -25 -25 -21 -11 -11 -11  -6 -30
## 48  1927 -28 -21 -39 -31 -25 -27 -15 -19  -6  -1  -4 -36
## 49  1928  -4 -12 -28 -29 -30 -41 -21 -25 -20 -19  -9 -20
## 50  1929 -47 -61 -34 -40 -39 -43 -33 -29 -23 -15 -14 -55
## 51  1930 -29 -24  -8 -26 -25 -19 -17 -11 -11  -8  14  -9
## 52  1931 -10 -22  -6 -21 -22  -6   1   0  -6   0 -12 -10
## 53  1932  13 -18 -20  -7 -22 -30 -24 -24 -11 -10 -26 -22
## 54  1933 -34 -32 -29 -23 -25 -32 -20 -23 -26 -24 -31 -47
## 55  1934 -27  -4 -31 -27 -11 -14 -11 -10 -16 -11  -1  -9
## 56  1935 -37  11 -13 -35 -26 -23 -19 -17 -17  -8 -29 -22
## 57  1936 -29 -39 -23 -20 -17 -19  -6 -12  -6  -4  -5  -4
## 58  1937 -11   5 -17 -17  -7  -8  -5   3  14  10   9 -12
## 59  1938   0  -4   5   5  -7 -17  -9  -4   3  11   1 -26
## 60  1939 -13 -12 -20 -12  -7  -8  -6  -5   0  -3   6  40
## 61  1940 -15   6  12  16   5   5  10   1  12   7  13  19
## 62  1941  13  23   6  11  10   4  15  14   2  24  12  14
## 63  1942  26   5  13  14  14  11   2  -3   0   6  13  12
## 64  1943  -1  22   1  13  10  -1  14   3  11  30  25  28
## 65  1944  41  31  34  27  26  22  23  23  31  27  12   5
## 66  1945  13   2  11  24  10   2   7  25  22  22  10 -10
## 67  1946  15   6   0  11  -4 -17  -9  -8  -2  -6  -2 -29
## 68  1947 -13  -8   5   4  -6   0  -6  -8 -14   6  -1 -18
## 69  1948   5 -13 -23  -9   8  -5 -13 -10 -10  -7  -8 -23
## 70  1949   9 -16  -1  -7  -9 -22 -13  -8  -8  -3  -8 -19
## 71  1950 -30 -26  -6 -21 -12  -6  -9 -18 -10 -20 -35 -20
## 72  1951 -35 -44 -19 -10  -2  -5   0   5   7   6   0  15
## 73  1952  16  12 -10   2  -5  -4   5   7   8  -4 -17  -2
## 74  1953   9  16  11  20   8   8   2   8   6   5  -5   3
## 75  1954 -28 -10 -12 -18 -20 -16 -16 -13  -7  -1   8 -18
## 76  1955  11 -21 -36 -23 -20  -8  -9   4 -13  -5 -28 -32
##  [ reached getOption("max.print") -- omitted 61 rows ]

Tidy-ifiying the columns

The data set still isn’t tidy: Month should be a variable, but it’s spread across columns. In tidyverse lingo, what we need to gather them up!

I’d like to take a moment here to point out a REALLY useful resource to use when you’re trying to figure out which tidyverse function you need to help you wrangle data: Data Wrangling Cheetsheet.

You can find this file in RStudio: Help -> Cheetsheets -> Data Manipulation with dplyr and tidyr

The picture that matches what we need to do tells us it’s a gather:

global_temps <- gather(raw_temps[1:13], key = "month", value = "index", Jan:Dec)

What happened there? Take a look at the table.

Last, but not least, it should bother you that the “Year” column is capitalized and the two others aren’t. I’d recommend always keeping your variable names in lower case for consistency.

We can use colnames to update the names of our columns and tolower to switch strings in character vectors to all lower case:

colnames(global_temps) <- tolower(colnames(global_temps))

Creating dates

The last problem we have to solve is to create dates from our year (numeric) and month (character) vectors. To do that, we’d first like to convert months from arbitrary strings into numbers.

Using a named vector

In operations like this, where you want to translate one set of values one-to-one into another set of values, one robust solution is to create a vector of values you want to translate to and then add names to the vector containing the values you want to translate from. Confused? It’s easier to understand with an example!

Let’s make a vector holding month numbers:

month_n <- 1:12

Remember what this does:

month_n[3]
## [1] 3

We can set the “names” of the numbers in this vector with the names function. I could type out my month strings by hand, but I’ll be lazy and use the fact that they’re already in columns in our original table:

colnames(raw_temps[2:13])
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"

So we can do this:

names(month_n) <- colnames(raw_temps[2:13])

Once you have a named vector, you can index elements by name in addition to using numeric indexes. For example:

month_n["Apr"]
## Apr 
##   4
month_n[c("Apr", "Aug")]
## Apr Aug 
##   4   8
month_n[c("Apr", "Aug", "Apr")]
## Apr Aug Apr 
##   4   8   4

Making our month number column then becomes as easy as indexing on our months column:

global_temps$month_n <- month_n[global_temps$month]

Using the repeat function

The solution above is robust in that it will work even if there’s isn’t a regular pattern in repetition of months down the rows of the table. But that actually is the case for our table: we have 12 months each repeated once for the 137 years contained in the data set.

If you need a new column that contains a set of values that regularly repeats, an alternative solution is to use the repeat function:

rep(1:12, each = 137)

The each argument says repeat each value 137 times.

Alternatively, we can specify a times:

rep(1:12, times = 137)
##    [1]  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11
##   [24] 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10
##   [47] 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9
##   [70] 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8
##   [93]  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7
##  [116]  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6
##  [139]  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5
##  [162]  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4
##  [185]  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3
##  [208]  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2
##  [231]  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1
##  [254]  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12
##  [277]  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11
##  [300] 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10
##  [323] 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9
##  [346] 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8
##  [369]  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7
##  [392]  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6
##  [415]  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5
##  [438]  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4
##  [461]  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3
##  [484]  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2
##  [507]  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1
##  [530]  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12
##  [553]  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11
##  [576] 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10
##  [599] 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9
##  [622] 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8
##  [645]  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7
##  [668]  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6
##  [691]  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5
##  [714]  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4
##  [737]  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3
##  [760]  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2
##  [783]  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1
##  [806]  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12
##  [829]  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11
##  [852] 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10
##  [875] 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9
##  [898] 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8
##  [921]  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7
##  [944]  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6
##  [967]  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5
##  [990]  6  7  8  9 10 11 12  1  2  3  4
##  [ reached getOption("max.print") -- omitted 644 entries ]

Note the difference!

Lubridate

Finally, we’ll use a helper function from a new package called lubridate to easily create a date column:

library(lubridate)
global_temps$date <- make_date(year = global_temps$year, month = global_temps$month_n)

We’ll explore Dates and Times in more depth next week.

Analysis

Now, on your own:

  • Plot temperature indexes as a function of time
  • Play with geom_smooth to add a kernel trend line (which is this a nice approach given these data)
  • If you’re feeling ambition try fitting a linear model to (some) of the data using lm
  • Make it interactive! Allow zooming on the x-axis.
  • Go back to the source data page and compare northern and southern hemisphere trends.