Class Meeting 2 Introduction to R

Today, we’ll get you up to speed with a minimum “need to know” about using R and RStudio. We’re going to assume you know nothing, but aren’t covering the breadth of the R/RStudio landscape.

The format of today’s notes aim to teach R by exploration, so is essentially an activity guide with prompts for exploration. These are mostly all exercises we’ll be doing together in class.

To participate in today’s lecture, you should have:

Announcements: (10 min)

  • Assignment 1 is launched!
  • Class meeting schedule is fixed

2.1 Learning Objectives

By the end of today’s class, students are expected to be able to:

  • Write an R script to perform simple calculations
  • Access the R documentation on an as-needed basis
  • Use functions and operators in R
  • Subset vectors in R
  • Explore a data frame in R
  • Load packages in R

2.2 Participation

Start a new R script in RStudio, and add your exploratory code to the script as we work through the exercises. What you write on this script doesn’t have to be exactly the same as what I write – we’re just looking for some exploration of coding in R.

2.3 Resources

Here are some useful resources for getting oriented with R.

Today, we’ll be learning just enough base R so that we can dive in to the tidyverse side of R. If you want to learn even more about base R, take a look at Mike Marin’s R playlist on YouTube.

2.4 Why R?

Why R? Some points taken from adv-r: intro:

  • Free, platform-wide
  • Open source
  • Comprehensive set of “add on” packages for analysis
  • Huge community

Alternatives exist for data analysis, python being another excellent tool, especially these days as it seems like more and more R-like functionality is added to it. The good thing about python is that it’s faster and has better support for machine learning models. For the sake of streamlining, both STAT 545A and STAT 547M only focus on R.

2.5 Orientation to R

2.5.1 Using R and RStudio (5 min)

Let’s try these exercises as our first steps.

  1. Try some arithmetic from a script vs. the console.
    • Notice that your commands appear in the “History” tab. Do not rely on this! What do you think is better than relying on the history?
  2. Store a number in a variable called number using <- (read this arrow as “gets”).
    • Notice that the object appears in the “Environment” tab in the top-right of RStudio.
  3. Try some arithmetic on the variable.
  4. Try some arithmetic on an undefined variable.
  5. Try some arithmetic on the variable on a line of code above the variable definition (do you think we’ll get an error?)

2.5.2 Vectors (3 min)

Vectors store multiple entries of a data type, like numbers. You’ll discover that they show up just about everywhere in R.

Let’s collect some data and store this in a vector called times. How long was your commute this morning, in minutes? Here’s starter code:

times <- c()

Operations happen component-wise. Let’s calculate those times in hours. How can we “save” the results?

2.5.3 Functions, Part I (3 min)

What’s the average travel time? Instead of computing this manually, let’s use a function called mean. Notice the syntax of using a function: the input goes inside brackets, which is followed by the function name to the left.

We input times, and got some output. Did this function change the input? Aside from some bizarre functions, this is always the case. Functions don’t always return a single value. Try the range() function, for example. What’s the output? What about the sqrt() function?

Much of R is about becoming familiar with R’s “vocabulary”. A nice list can be found in Advanced R - Vocabulary.

2.5.4 Comparisons (7 min)

We’ll now introduce logicals.

Which of our travel times are less than (say) 30 minutes? Use <.

Which of our travel times are equal to … (pick something)? What about not equal to it? Notice the use of == as opposed to = – why do you think that is?

Which of our travel times are greater than …(lower)… and less than …(upper)…? What about less than …(lower)… or greater than …(upper)…?

Some functions expect logical inputs. Try using the which() function on one of the above. What about any()? all()?

Logicals can be explicitly specified in R with TRUE and FALSE.

2.5.5 Subsetting (10 min)

Use [] to subset the vector of times:

  1. Extract the third entry.
  2. Extract everything except the third entry.
  3. Extract the second and fourth entry. The fourth and second entry.
  4. Extract the second through fifth entry – make use of : to construct sequential vectors.
  5. Extract all entries that are less than 30 minutes. Why does this work? Logical subsetting!

After all of that, did our times object change at all?

We can use [] in conjunction with <- to change the times object:

  1. Replace two entries with new travel times.
  2. “Cap” entries that are “too large” at some set value. If this is more than one value, why don’t we need to match the number of values? Recycling!
  3. Remove an entry, by overwriting times.

2.5.6 NA (2 min)

Sometimes we have missing data. Those entries are replaced with NA in R. Be careful with these!

  1. Add NA to the vector of times.
  2. What’s the mean of this new vector of times?

Let’s expand our view of functions in order to solve this problem.

2.5.7 Functions, Part II (10 min)

Functions often take more than one arguments as input, separated by commas. You can find out what these arguments are by accessing the function’s documentation:

Access the documentation of the mean() function by executing ?mean.

  • There are four arguments.
  • All the arguments have names, except for the ... argument (more on ... later). This is always the case.
  • Under “Usage”, some of the arguments are of the form name = value.
    • These are default values, in case you don’t specify these arguments.
    • This is a sure sign that these arguments are optional.
  • x is “on its own”. This typically means that it has no default, and often (but not always) means that the argument is required.

We can specify an argument in one of two ways:

  • specifying argument name = value in the function parentheses; or
  • matching the ordering of the input with the ordering of the arguments.
    • For readability, this is not recommended beyond the first or sometimes second argument!

Input TRUE for the na.rm argument in both ways.

2.5.8 Data frames (12 min)

Living in a vector-only world would be nice if all data analyses involved one variable. When we have more than one variable, data frames come to the rescue. Basically, a data frame holds data in tabular format.

R has some data frames “built in”. For example, motor car data is attached to the variable name mtcars.

Print mtcars to screen. Notice the tabular format.

Your turn (5 min): Finish the exercises of this section:

  1. Use some of these built-in R functions to explore mtcars, without printing the whole thing to screen:
    • head(), tail(), str(), nrow(), ncol(), summary(), row.names() (yuck), names().

Notice that names and row.names() outputs a character vector (we’ve already seen numeric and logical vectors). These are useful for characterizing categorical data in R.

  1. What’s the first column name in the mtcars dataset?
  2. Which column number is named "wt"?

Each column is its own vector that can be extracted using $. For example, we can extract the cyl column with mtcars$cyl.

  1. Extract the vector of mpg values. What’s the mean mpg of all cars in the dataset?

2.5.9 R packages (13 min)

Usually, the suite of functions that “come with” R are just not enough to do an analysis.

Usually, the suite of functions that “come with” R are just not very convenient.

In come R packages to the rescue. These are “add ons”, each coming with their own suite of functions and objects, usually designed to do one type of task. CRAN stores packages that, for all intents and purposes, can be considered “official” R packages. It’s easy to install packages from CRAN! Just use the install.packages() function.

Run the following lines of code to install the tibble and gapminder packages. (But don’t include this in your scripts – it’s not very nice to others!)

install.packages("tibble")
install.packages("gapminder")
  • tibble: a data frame with some useful “bells and whistles”
  • gapminder: a package that makes the gapminder dataset available (as a tibble!)

Installing a package is not enough! To access its functions, you have to load it. Use the library() function to load a package. (Note: ironically, it’s not libraries we load with the library() function, but a package).

Run the following lines of code to load the packages. (Do put these in your scripts, and near the top)

library(tibble)
library(gapminder)

Take a look at the packages under the “Global Environment” tab to see the new objects that have just been made available to us. PS: you’ll notice mtcars is not in our workspace/environment, yet we can still access it – where does mtcars live?

Try the following two approaches to access information about the tibble package. Run the lines one-at-a-time. Vignettes are your friend, but do not always exist.

?tibble
browseVignettes(package = "tibble")

Print out the gapminder object to screen. It’s a tibble – how does it differ from a data frame in terms of how it’s printed?

Because a tibble is a data frame, our exploration functions still work on it. Try some.

2.5.10 Two slogans to understand computations in R (6 min)

(We probably won’t have time to cover this, and that’s OK – I’m leaving it here for you to peruse if you are interested).

John Chambers eloquently sums up using R:

To understand computations in R, two slogans are helpful:

  • Everything that exists is an object.
  • Everything that happens is a function call.

These are useful to remember to prevent us from getting confused.

  1. Everything that exists is an object.

This is not obvious when we look at the output of, say, str():

The stuff you see is simply printed to screen, not an object! The actual object is NULL:

The output of summary() is actually a “table” object (something not often used in R). Let’s coerce it to character data:

  1. Everything that happens is a function call.

Did you know that operators like + are actually functions? The “plus” function is literally `+`(), and accepts two arguments.

Here is what’s actually happening when we call 5 + 2:

Want a challenge? What’s the difference between the `(`() function and the `{`() function? Hint check the documentation with ?`{`.

2.6 Finishing up (5 min)

  1. Highly recommended: Don’t save your workspace when you quit RStudio. Make this a default:
    • Go to “RStudio” -> “Preferences…” -> “General”
    • Uncheck “restore .RData into workspace on startup”
    • Select: “Save workspace to RData on exit:” Never
  2. Push your final script to GitHub (you can do this in a simple way by dragging the file onto your respository homepage).

Don’t forget! There’s an office hour after every class, held upstairs in ESB 3174.