Class Meeting 11 File input/output (I/O)
Today’s class is all about reading in and writing out data files into and out of R.
11.1 Worksheet
Normally there would be a separate worksheet file, but we’ll do everything in-line for this class
11.2 Resources
11.2.1 References and tutorials
11.2.2 Package documentation
11.3 Writing data to disk (10 mins)
Let’s first load the built-in gapminder dataset and the tidyverse:
Next, let’s filter the data only from 2007 and only in the Asia continent and save it to a variable.
## # A tibble: 33 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Bahrain Asia 2007 75.6 708573 29796.
## 3 Bangladesh Asia 2007 64.1 150448339 1391.
## 4 Cambodia Asia 2007 59.7 14131858 1714.
## 5 China Asia 2007 73.0 1318683096 4959.
## 6 Hong Kong, China Asia 2007 82.2 6980412 39725.
## 7 India Asia 2007 64.7 1110396331 2452.
## 8 Indonesia Asia 2007 70.6 223547000 3541.
## 9 Iran Asia 2007 71.0 69453570 11606.
## 10 Iraq Asia 2007 59.5 27499638 4471.
## # … with 23 more rows
We can write this to a comma-separated value (csv) file with just one command:
write_csv(gap_asia_2007,"exported_file.csv")
But let’s not just write the csv file anywhere, we should download the file in a sensible location. The next section talks about the here
package
11.4 Making the case for here::here()
If you wanted to make your Rproj more portable and accessible to more users in a cross-platform (between Mac, Unix, Windows users), rather than specifying every path explicitly, here::here() allows you to set relative paths much more easily.
For example, create a new folder “data” and within it a subfolder called “cm011_data”, specify here() and then save a file to that location with these commands:
11.4.1
More thorough notes to come…
In the meantime, read this short article for an excellent explanation of why we use the here::here() package in R.
11.5 Reading data from disk (5 mins)
The same csv file that we just saved to disk can be imported into R again by specifying the path where it exists:
## Parsed with column specification:
## cols(
## country = col_character(),
## continent = col_character(),
## year = col_double(),
## lifeExp = col_double(),
## pop = col_double(),
## gdpPercap = col_double()
## )
## # A tibble: 33 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Bahrain Asia 2007 75.6 708573 29796.
## 3 Bangladesh Asia 2007 64.1 150448339 1391.
## 4 Cambodia Asia 2007 59.7 14131858 1714.
## 5 China Asia 2007 73.0 1318683096 4959.
## 6 Hong Kong, China Asia 2007 82.2 6980412 39725.
## 7 India Asia 2007 64.7 1110396331 2452.
## 8 Indonesia Asia 2007 70.6 223547000 3541.
## 9 Iran Asia 2007 71.0 69453570 11606.
## 10 Iraq Asia 2007 59.5 27499638 4471.
## # … with 23 more rows
Normally we would store the imported data into a new variable and you can use that by assigning the output to a variable. Notice that the output of the imported file is the same as the original tibble, and read_csv was intelligent enough to detect the types of the columns. This won’t always be true so it’s worth checking! The read_csv package has many additional options including the ability to skip columns, skip rows, rename columns on import, trim whitespace, and more…
11.6 Import a file from the web/cloud
11.6.1 Import a csv file from the internet
To import a csv file from a web, assign the URL to a variable
and then apply read_csv file to the url
.
## Parsed with column specification:
## cols(
## Magazine = col_character(),
## AdRevenue = col_double(),
## AdPages = col_double(),
## SubRevenue = col_double(),
## NewsRevenue = col_double()
## )
## # A tibble: 204 x 5
## Magazine AdRevenue AdPages SubRevenue NewsRevenue
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Weekly World News 2280 300 854 16568
## 2 National Examiner 3382 380 968 27215
## 3 J-14 4218 250 2206 12453
## 4 Soap Opera Weekly 4622 439 5555 24282
## 5 Easyriders 5121 524. 4155 9929
## 6 Mary Engelbreit's Home Companion 5259 189 9048 4363
## 7 Official Xbox Magazine 5838 542. 4311 10320
## 8 Weight Watchers 6986 287. 9202 4048
## 9 Globe 7634 380 2180 63771
## 10 PSM: 100% Independent PlayStation 2… 8034 720. 6846 5271
## # … with 194 more rows
11.6.2 Import an excel file (.xls) from the internet
First, we’ll need the package to load in excel files:
’Datafiles from this tutorial were obtained from: https://beanumber.github.io/sds192/lab-import.html#data_from_an_excel_file
To import an .xls or .xlsx file from the internet, you first need to download it locally. The read_excel function from the readxl package can help us read it after we download it. To download it, create a new variable called xls_url, as well the name of the destination file you would like to download the data into.
xls_url <- "http://gattonweb.uky.edu/sheather/book/docs/datasets/GreatestGivers.xls"
download.file(xls_url,here::here("data","cm011_data","some_file.xls"))
NOTE: If you are on windows and end up downloading a corrupt file, you need to add an extra argument: download.file(…, mode=“wb”). More details about this behaviour can be found here.
Naming a file “some_file” is extremely bad practice (hard to keep track of the files) and I would strongly encourage you to name the file similar (or the same) to the original file. Let me show you a handy trick to extract the filename from the URL:
Now we can import the file:
## # A tibble: 50 x 8
## Rank Name Background `2003-07 Given … Causes `Estimated lift… `Net Worth`
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 1 Warr… Berkshire… 40,650 Healt… 40780 52000
## 2 2 Bill… Microsoft… 3,519 Globa… 28144 59000
## 3 3 Geor… Oil and g… 2,271 Pover… 2522 11000
## 4 4 Geor… Investor 2,109 Open … 6401 8800
## 5 5 Gord… Intel co-… 2,067 Envir… 7404 4500
## 6 6 Walt… Family of… 1,475 Educa… 2015 82500
## 7 7 Herb… Golden We… 1,368 Medic… 1389 2400
## 8 8 Eli … SunAmeric… 1,216 Publi… 2286 7000
## 9 9 Dona… Real esta… 915 Educa… 1326 13000
## 10 10 Jon … Huntsman … 800 Cance… 1233 1900
## # … with 40 more rows, and 1 more variable: `Giving%` <chr>
11.7 Read in a sample Excel file. (Optional)
Let’s load in a sample dataset from a PhD research project in MRI (magnetic resonance imaging). In MRI, subjects are imaged and the data collected can be visualized in “slices”. If a human head was being imaged, the first slice might be a cross-section of the neck. The next slice would be an image in the same plane as the first slice, but a few mm above the first slice, and so on and so forth. The exact details of the dataset aren’t important, but here are key details we can extract from the dataset:
- the first column contains the subject ID (e.g., “HerS18Bs01.BS1/8”)
- Each row contains 8 measurements, each from a different slice in the image
- The next column is a weighted average, and the final column is the volume measurement of all 8 slices
- The data is not in tidy format
- The subjects are divided into two treatment groups: “Avastin” and “Herceptin”, but unfortunately, this information is not captured in the table
- The relevant data can be found in the range A1:K12 since the rest of the data contains either footer rows or repeated data.
mri_file = here::here("data","cm011_data","Firas-MRI.xlsx")
mri <- read_excel(mri_file, range = "A1:K12")
Viewing the data using the View()
function will allow you to investigate how the imported data looks in R. As hinted above, it looks like we have to do one preliminary data-processing steps before we can import all the data.
Notice that below row 12, two subsets of the data from the first 12 rows is duplicated into two groups: “Avastin” and “Herceptin” treated. Indeed, the person who did the analysis copied the raw data and split it into the two groups. While it is possible to do this processing in R, in this case it is better that we do this directly in the excel file. Let’s do this and save a new file called Firas-MRI_minor_cleaning.xlsx
Note also that column J is a calculated column of weighted averages. Rather than bring this column in a-is, we want to remove this column and then later calculate it in R if we need to.
We are now ready to import in the newly processed file:
mri_file = here::here("data","cm011_data","Firas-MRI.xlsx")
mri <- read_excel(mri_file, range = "A1:L12")
## New names:
## * `` -> ...12
Finally, let’s make our data tidy using pivot_longer
:
mri <- mri %>%
tidyr::pivot_longer(cols = `Slice 1`:`Slice 8`,
names_to = 'slice_no',
values_to = 'value')
mri
## # A tibble: 88 x 5
## `Animal ID` Volume ...12 slice_no value
## <chr> <dbl> <lgl> <chr> <dbl>
## 1 HerS18Bs01.BS1/8 523. NA Slice 1 11.1
## 2 HerS18Bs01.BS1/8 523. NA Slice 2 14.7
## 3 HerS18Bs01.BS1/8 523. NA Slice 3 16.5
## 4 HerS18Bs01.BS1/8 523. NA Slice 4 14.3
## 5 HerS18Bs01.BS1/8 523. NA Slice 5 18.9
## 6 HerS18Bs01.BS1/8 523. NA Slice 6 22.5
## 7 HerS18Bs01.BS1/8 523. NA Slice 7 19.0
## 8 HerS18Bs01.BS1/8 523. NA Slice 8 18.8
## 9 HerS18Bs02.BS1/7 400. NA Slice 1 10.3
## 10 HerS18Bs02.BS1/7 400. NA Slice 2 7.90
## # … with 78 more rows
And we are done! Ready to explore trends in this dataset.