Class Meeting 19 (5) Automate tasks and pipelines I

100% complete

19.1 Today’s Agenda

  • Announcements: (10 mins)
    • COVID-19 precautions
    • Assignment 1 & 2 solutions will be posted tonight
    • Assignment 3 is released (Peer review)
    • Milestone 3 is released
    • In the (currently very unlikely) event of UBC’s closure
  • Part 1: Odds and ends (5 mins)
    • Loading/Saving model objets in R
    • Running knitr from command line
  • Part 2: Run an analysis from start to finish (15 mins)
    • Clone the demo_project repo
    • Run each script individually
    • Delete files and run each script
  • Part 3: Building a data analysis pipeline using MAKE (45 mins)
    • Make targets
    • Make clean
    • Make all

19.2 Part 1: Odds and ends

19.2.1 Knitting files from the command line as a script

You can knit an Rmd file by clicking th “Knit” button in RStudio or by running the following command in an R console:

rmarkdown::render('docs/finalreport.Rmd', 
                   c("html_document", "pdf_document"))

Notice that the first argument is the file that should be knit, and the second argument is a vector of the output types: html and pdf. Though it seems silly to just have a single line of code as a script, there are many more options available when knitting an Rmd file. You’ll see more of these options in Thursday’s lecture (cm107)

19.2.2 Saving and loading R objects

The .rds and .rda file formats are special files that permit you to save single (rds) or multiple R objects (rda).

saveRDS(object, file = "filename.rds")

This file can be the output of a script (for instance, your analysis script). To read the object back into R, simply use readRDS:

readRDS(file = "filename.rds")

To save multiple R objects, specify them with commas:

save(data1, data2, file = "data1and2.rda")

and then to load it again,

load("data1and2.rda")

The above material was from this documentation.

19.3 Part 2: Run an analysis from start to finish

Building a data analysis pipeline using the UNIX shell

To illustrate how to Make a data analysis pipeline, we are going to work through an example together. First, you will have to clone the demo project locally here. It contains several files, but the most relevant are the ones in the docs directory and the src (scripts) directory:

  • load.R: Loads the data file into a CSV.

  • clean.R: Cleans the data.

  • eda.R: Performs exploratory data analysis.

  • knit.R: Knits the .Rmd file into pdf and html files.

  • finalreport.Rmd : template of the final report (technically Rmd files are also scipt)

All those scripts are linked, they use each others’ outputs as their inputs. can say that those scripts implement a data analysis pipeline.

  1. Load the data from a URL and save it to a location on your computer
  2. Save the dataset as a simple csv for later loading
  3. Clean and re-save the data
  4. Create the EDA and generate the corresponding plots
  5. Do the modelling (to be done in milestone3!)
  6. Generate the final report (to be done in milestone3!)

19.4 Part 3: Building a data analysis pipeling using MAKE

19.4.1 Test to ensure you have make installed

  • Open a new Terminal window
  • Type in make and press enter
  • If you have make installed, it should show something like: make: *** No targets specified and no makefile found. Stop.
  • If you do not have it installed, it will say something like “Command not found”

Notes for Windows Users

  • If you do not have MAKE installed (should not be a problem on Windows or macOS), DO NOT spend class time trying to install it, it’s better you try to work through the class and understand how it works. We can try and help you during office hours to get it installed.

These instructions do not apply to Mac or Linux machines!

  1. Go to the Make for Windows web site.
  2. Download the Setup program.
  3. Install the file you just downloaded and copy to your clipboard the directory in which it is being installed.
  4. FYI: The default directory is C:Files (x86)32
  5. You now have make installed, but you need to tell Windows where to find the program. 1. This is called updating your PATH. You will want to update the PATH to include the bin directory of the newly installed program.

Here is how to update your path:

If you installed Make for Windows (as opposed to the make that comes with Git for Windows), you still need to update your PATH.

These are the steps on Windows 7 (we don’t have such a write-up yet for Windows 8 – feel free to send one!):

  • Click on the Windows logo.
  • Right click on Computer.
  • Select Properties.
  • Select Advanced System Settings.
  • Select Environment variables.
  • Select the line that has the PATH variable. You may have to scroll down to find it.
  • Select Edit.
  • Go to the end of the line and add a semicolon ;, followed by the path where the program was installed, followed by .
  • Typical example of what one might add: ;C:Files (x86)32
  • Click Okay and close all the windows that you opened.
  • Quit RStudio and open it again.
  • You should now be able to use make from RStudio and the command line.

19.4.2 Introduction to MAKE

GNU Make is a tool that controls the execution of files. In order to explain to Make how to run our data analysis pipeline, we need the create a new file called Makefile (no file extension).

Each Makefile is made of blocks of code, called rules. A rule follows the following structure :

file_to_create.png : data_it_depends_on.csv script_it_depends_on.R
      action
  • file_to_create.png is a “target”: it is the file that we want to output/create. If there are multiple targets, you just have to seperate the different target names by a space:
file_to_create.png file2.png : data_it_depends_on.csv script_it_depends_on.R
      action
  • data_it_depends_on.csv and script_it_depends_on.R are the dependencies : they are the scripts/files needed to build or update the target. There can be zero or more dependencies.

Note : The target and the dependencies are separated by :.

  • action is an action : it is the command to run in order to build/update the target using the dependencies. You MUST use a TAB to indent an action.

Let’s create our first Makefile now!

19.4.3 Create a Makefile

Open RStudio. Select File > New File > Text File. Save this file with the name Makefile in your project directory. This is very important as by default, Make looks for the file called Makefile. Now, we are going to add several rules to this Makefile.

Let’s add our first rule to this Makefile. Try to add the following code to your file :

# author: Your name
# date: The date

# download data
autism.csv : src/load.R
<TAB>Rscript src/load.R --filepath="data/autism.csv" --url_to_read="https://github.com/STAT547-UBC-2019-20/data_sets/raw/master/Autism-Adult-Data.arff"
  • download_data is the “name” of the target you want to create (this should be closely related to the action that gets triggered).

  • src/load.R is the script that we want to execute.

  • Rscript src/load.R --data_filepath="data/autism.csv" --url_to_read="https://github.com/STAT547-UBC-2019-20/data_sets/raw/master/Autism-Adult-Data.arff" is the action that we want to do : it is the command to run in order to build the target using the dependencies.

19.4.3.1 Don’t forget to use a TAB to indent the action!!

It’s worth it for us to pause here and make sure that the usage of TAB is emphasized. By default, RStudio replaces the “TAB” character with two spaces. Once you’ve saved the file with the name Makefile, RStudio should indent with tabs instead of spaces. I recommend you display whitespace in order to visually confirm this: RStudio > Preferences > Code > Display > Display whitespace characters.

19.4.3.2 Return to the Makefile

Now, make sure that you save this file in the root directory of the repository. Once this is done, let’s try to run Make on the target that we created: “download_data”. To do so, you have to run the following command in your Terminal :

make autism.csv

Make then runs the script in the action you created. Note that by default, the first target is run if you just type in make in a directory with a Makefile.

If you see this kind of error:

Makefile:3: *** missing separator.  Stop.

check the syntax, make sure the colon is in the right place, script names and directories are correct, and finally check that you used the TAB key to indent the action.

If you try to run Make again, without changing anything, Make will tell you that everything is up to date. This is one of the reasons why Make is more efficient than running individual scripts one by one : before running a script, Make checks the ‘last modification time’ of both the target and its dependencies. If any of the dependencies has been updated after the target was generated for the last time, Make will run the action. In other words, if the target already exists and the dependencies have not been modified since the creation of the target, Make will not run the script again! This is why Make is faster than running multiple scripts over and over again.

Another advantage of Make is that it allows us to remove all the intermediate data files that have been created by our data analysis pipeline, with only one command. This can be useful if we want to run our analysis from scratch. To do so, we have to create a new target called clean,. As the actions for this target, it should deletes all files generated from our script. In this case, the clean target should not have any dependencies. Here is how to create that target :

# author: Your name
# date: The date

# download data
autism.csv : src/load.R
<TAB>Rscript src/load.R --filepath="data/autism.csv" --url_to_read="https://github.com/STAT547-UBC-2019-20/data_sets/raw/master/Autism-Adult-Data.arff"

clean :
<TAB>rm -f data/*
<TAB>rm -f images/*
<TAB>rm -f docs/*.md
<TAB>rm -f docs/*.html

Note that you can even have multiple actions if you put one one each line! The -f flag in the rm command ignores nonexistent files and arguments, and does not prompt you to confirm (see the manual for more flags)

Now, try to delete the data/autism.csv file as well as all files (* character) in the images folder using `Make - be careful using this!.

make clean

If you take a look at your data/autism.csv file, you will see it is gone, and all files in the images folder will be gone.

19.4.4 Your turn to finish the Makefile for the demo_project

Try to complete the Makefile that we created in order to execute the whole data analysis pipeline.

# author: Your name
# date: The date

# download data
data/autism.csv : src/load.R
<TAB>Rscript src/load.R --filepath="data/autism.csv" --url_to_read="https://github.com/STAT547-UBC-2019-20/data_sets/raw/master/Autism-Adult-Data.arff"

# clean data
## YOUR SOLUTION HERE
data/autism_cleaned.csv : src/clean.R data/autism.csv
<TAB>Rscript src/clean.R --filepath_raw="data/autism.csv" --filepath_cleaned="data/autism_cleaned.csv"

# EDA
## YOUR SOLUTION HERE
images/barplot.png images/correlation.png images/propbarplot.png : src/eda.R data/autism_cleaned.csv
<TAB>Rscript src/eda.R --filepath_cleaned="data/autism_cleaned.csv"

# Knit report
## YOUR SOLUTION HERE
docs/finalreport.html docs/finalreport.pdf : images/barplot.png images/correlation.png images/propbarplot.png docs/finalreport.Rmd data/autism_cleaned.csv src/knit.R docs/asd_refs.bib
<TAB>Rscript src/knit.R --finalreport="docs/finalreport.Rmd"
    
clean :
<TAB>rm -f data/*
<TAB>rm -f images/*
<TAB>rm -f docs/*.md
<TAB>rm -f docs/*.html

At this point, the major part of the Makefile is done! There is just one last thing we have to do.

Now that we have several targets, if we want to run the Makefile, we have to specify to Make which target we are interrested in. For instance, if I just want to update/create the figure images/correlation.png, I would run in my console:

$ make images/correlation.png

But several times, the final output of you analysis is made of several files/figures, which do not always depend on each other… so we would have to call make on each one of those files/figures. We would have to run multiple make command to make sure we capture everything and all the dependencies. This is prone to errors.

This is why we are going to use another target all. all is going to be a target, and its dependencies will be the final outputs of our data analysis pipeline. There will be no action associated to this target.

If we come back to our example, the all target would be the following :

all: docs/finalreport.html docs/finalreport.pdf

We can see that all has only one dependency because all the other plots/files are already dependencies of previous targets.

19.4.5 “Phony” targets

The .PHONY line is where you declare which targets are “phony” i.e. are not actual files to be made in the literal sense. Both all and clean are phony targers. It’s a good idea to explicitly tell make which targets are phony, instead of letting it try to deduce this. make can get confused if you create a file that has the same name as a phony target. If for example you create a directory named clean to hold your clean data and run make clean, then make will report clean is up to date, because a directory with that name already exists.

To take care of this, we just need to tell our Makefile (at the top of the file) which of the targets are “phony”:

.PHONY: all clean

Then, the final Makefile is :

# author: Your name
# date: The date

.PHONY: all clean

## YOUR SOLUTION HERE FOR THE `all` target
all: docs/finalreport.html docs/finalreport.pdf

# download data
data/autism.csv : src/load.R
<TAB>Rscript src/load.R --filepath="data/autism.csv" --url_to_read="https://github.com/STAT547-UBC-2019-20/data_sets/raw/master/Autism-Adult-Data.arff"

# clean data
## YOUR SOLUTION HERE
data/autism_cleaned.csv : src/clean.R data/autism.csv
<TAB>Rscript src/clean.R --filepath_raw="data/autism.csv" --filepath_cleaned="data/autism_cleaned.csv"

# EDA
## YOUR SOLUTION HERE
images/barplot.png images/correlation.png images/propbarplot.png : src/eda.R data/autism_cleaned.csv
<TAB>Rscript src/eda.R --filepath_cleaned="data/autism_cleaned.csv"

# Knit report
## YOUR SOLUTION HERE
docs/finalreport.html docs/finalreport.pdf : images/barplot.png images/correlation.png images/propbarplot.png docs/finalreport.Rmd data/autism_cleaned.csv src/knit.R docs/asd_refs.bib
<TAB>Rscript src/knit.R --finalreport="docs/finalreport.Rmd"
    
clean :
<TAB>rm -f data/*
<TAB>rm -f images/*
<TAB>rm -f docs/*.md
<TAB>rm -f docs/*.html

Now, you just have to run make all on your Terminal, and it will run the whole data analysis pipeline from beginning to end! And if you want to clear all the outputs of your different scripts, you just have to run make clear.

This is it! Now you know how to build an efficient data analysis pipeline from beginning to end!

19.5 Additional Resources