Class Meeting 6 Intro to data wrangling, Part I
Worksheet: You can find a worksheet template for today here.
6.1 Today’s Lessons
Today we’ll introduce the dplyr
package. Specifically, we’ll look at these three lessons:
- Intro to
dplyr
syntax - The
dplyr
advantage - Relational/comparison and logical operators in R
6.2 Resources
All three of today’s lessons are closely aligned to the stat545: dplyr-intro.
More detail can be found in the r4ds: transform chapter, up until and including the select()
section. Section 5.2 also elaborates on relational/comparison and logical operators in R
Here are some supplementary resources:
- A similar resource to the r4ds one above is the intro to dplyr vignette, up until and including the
select()
section. - Want to read more about piping? See r4ds: pipes.
6.3 Participation
To get participation points for today, we’ll be filling out the cm006-exercise.Rmd file, and adding it to your participation repo.
6.4 Intro to dplyr
syntax
6.4.1 Learning Objectives
Here are the concepts we’ll be exploring in this lesson:
- tidyverse
dplyr
functions:- select
- arrange
- piping
By the end of this lesson, students are expected to be able to:
- subset and rearrange data with
dplyr
- use piping (
%>%
) when implementing function chains
6.4.2 Preamble
Let’s talk about:
- The history of
dplyr
:plyr
- tibbles are a special type of data frame
- the tidyverse
6.4.3 Demonstration
Let’s get started with the exercise:
- Open RStudio, and download the
tidyverse
meta-package by executinginstall.packages("tidyverse")
into the R console. - Optional: open the
STAT545_participation
RStudio project in RStudio. - With RStudio, open the
cm006-exercise.Rmd
file you downloaded and committed earlier. - Follow the instructions in the
.Rmd
file until the resume lecture section.
6.5 Small break
Here are some things you might choose to do on this break:
- Talk with a TA, Vincenzo, or your neighbour(s) about the content so far.
- Attempt the bonus exercises on the
cm006-exercise.Rmd
file. - Work on an assignment.
6.6 The dplyr
advantage
6.6.1 Learning Objectives
By the end of this lesson, students are expected to be able to:
- Have a sense of why
dplyr
is advantageous compared to the “base R” way with respect to good coding practice.
Why?
- Having this in the back of your mind will help you identify qualities of and produce a readable analysis.
6.6.2 Compare base R to dplyr
Self-documenting code.
This is where the tidyverse shines.
Example of dplyr
vs base R:
gapminder %>%
filter(country == "Cambodia") %>%
select(year, lifeExp)
vs.
gapminder[gapminder$country == "Cambodia", c("year", "lifeExp")]
No need to take excerpts.
Wrangle with dplyr
first, then pipe into a plot/analysis.
OR, use the subset
argument that’s often offered by R functions like lm()
.
Especially don’t use magic numbers to subset!
Note that you need to use the assignment operator to store changes!
6.7 Relational/Comparison and Logical Operators in R
6.7.1 Learning Objectives
Here are the concepts we’ll be exploring in this lesson:
- Relational/Comparison operators
- Logical operators
dplyr
functions:- filter
- mutate
By the end of this lesson, students are expected to be able to:
- Predict the output of R code containing the above operators.
- Explain the difference between
&
/&&
and|
/||
, and name a situation where one should be used over the other. - Subsetting and transforming data using filter and mutate
6.7.2 R Operators
Arithmetic operators allow us to carry out mathematical operations:
Operator | Description |
---|---|
+ | Add |
- | Subtract |
* | Multiply |
/ | Divide |
^ | Exponent |
%% | Modulus (remainder from division) |
Relational operators allow us to compare values:
Operator | Description |
---|---|
< | Less than |
> | Greater than |
<= | Less than or equal to |
>= | Greater than or equal to |
== | Equal to |
!= | Not equal to |
- Arithmetic and relational operators work on vectors.
Logical operators allow us to carry out boolean operations:
Operator | Description |
---|---|
! | Not |
| | Or (element_wise) |
& | And (element-wise) |
|| | Or |
&& | And |
- The difference between
|
and||
is that||
evaluates only the first element of the two vectors, whereas|
evaluates element-wise.
6.7.3 Demonstration
Continue along with the cm006-exercise.Rmd
file.
6.8 If there’s time remaining
- Let’s do the bonus exercises together, in the
cm006-exercise.Rmd
file. - Another “break”