This is meant to be used in RStudio not CoCalc!
Open your Rproj file
First, open your R Project file (library_carpentry.Rproj) created in the Before We Start lesson.
If you did not complete that step, do the following:
- Under the File menu, click on New project, choose New directory, then New project
- Enter the name library_carpentry for this new folder (or “directory”). This will be your working directory for the rest of the day.
- Click on Create project
- Create a new file where we will type our scripts. Go to File > New File > R script. Click the save icon on your toolbar and save your script as “script.R”.
Presentation of the data
This data was downloaded from the University of Houston–Clear Lake Integrated Library System in 2018. It is a relatively random sample of books from the catalog. It consists of 10,000 observations of 11 variables.
These variables are:
- CALL…BIBLIO. : Bibliographic call number. Most of these are cataloged with the Library of Congress classification, but there are also items cataloged in the Dewey Decimal System (including fiction and non-fiction), and Superintendent of Documents call numbers. Character.
-
X245.ab : The title and remainder of title. Exported from MARC tag 245 ab fields. Separated by a pipe character. Character. - X245.c : The author (statement of responsibility). Exported from MARC tag 245 c. Character.
- TOT.CHKOUT : The total number of checkouts. Integer.
- LOUTDATE : The last date the item was checked out. Date. YYYY-MM-DDThh:mmTZD
-
SUBJECT : Bibliographic subject in Library of Congress Subject Headings. Separated by a pipe character. Character. - ISN : ISBN or ISSN. Exported from MARC field 020 a. Character
- CALL…ITEM : Item call number. Most of these are NA but there are some secondary call numbers.
- X008.Date.One : Date of publication. Date. YYYY
- BCODE2 : Item format. Character.
- BCODE1 Sub-collection. Character.
Getting data into R
Ways to get data into R In order to use your data in R, you must import it and turn it into an R object. There are many ways to get data into R.
- Manually: You can manually create it using the data.frame() function in Base R, or the tibble() function in the tidyverse.
- Import it from a file Below is a very incomplete list
- Text: TXT (readLines() function)
- Tabular data: CSV, TSV (read.table() function or readr package)
- Excel: XLSX (xlsx package)
- Google sheets: (googlesheets package)
- Statistics program: SPSS, SAS (haven package)
- Databases: MySQL (RMySQL package)
- Gather it from the web: You can connect to webpages, servers, or APIs directly from within R, or you can create a data scraped from HTML webpages using the rvest package. For example
- the Twitter API with twitteR
- Crossref data with rcrossref
- World Bank’s World Development Indicators with WDI.
Organizing your working directory
Using a consistent folder structure across your projects will help keep things organized and make it easy to find/file things in the future. This can be especially helpful when you have multiple projects. In general, you might create directories (folders) for scripts, data, and documents. Here are some examples of suggested directories:
- data/ Use this folder to store your raw data and intermediate datasets. For the sake of transparency and provenance, you should always keep a copy of your raw data accessible and do as much of your data cleanup and preprocessing programmatically (i.e., with scripts, rather than manually) as possible.
- data_output/ When you need to modify your raw data, it might be useful to store the modified versions of the datasets in a different folder.
- documents/ Used for outlines, drafts, and other text.
- fig_output/ This folder can store the graphics that are generated by your scripts.
- scripts/ A place to keep your R scripts for different analyses or plotting.
You may want additional directories or subdirectories depending on your project needs, but these should form the backbone of your working directory.
The working directory
The working directory is an important concept to understand. It is the place on your computer where R will look for and save files. When you write code for your project, your scripts should refer to files in relation to the root of your working directory and only to files within this structure.
Using RStudio projects makes this easy and ensures that your working directory is set up properly. If you need to check it, you can use getwd(). If for some reason your working directory is not what it should be, you can change it in the RStudio interface by navigating in the file browser to where your working directory should be, clicking on the blue gear icon “More”, and selecting “Set As Working Directory”. Alternatively, you can use setwd(“/path/to/working/directory”) to reset your working directory. However, your scripts should not include this line, because it will fail on someone else’s computer.
Setting your working directory with setwd() Some points to note about setting your working directory:
The directory must be in quotation marks.
On Windows computers, directories in file paths are separated with a backslash \. However, in R, you must use a forward slash /. You can copy and paste from the Windows Explorer window directly into R and use find/replace (Ctrl/Cmd + F) in R Studio to replace all backslashes with forward slashes.
On Mac computers, open the Finder and navigate to the directory you wish to set as your working directory. Right click on that folder and press the options key on your keyboard. The ‘Copy “Folder Name”’ option will transform into ‘Copy “Folder Name” as Pathname. It will copy the path to the folder to the clipboard. You can then paste this into your setwd() function. You do not need to replace backslashes with forward slashes.
After you set your working directory, you can use ./ to represent it. So if you have a folder in your directory called data, you can use read.csv(“./data”) to represent that sub-directory.
Downloading the data and getting set up
Now that you have set your working directory, we will create our folder structure using the dir.create() function.
For this lesson we will use the following folders in our working directory: data/, data_output/ and fig_output/. Let’s write them all in lowercase to be consistent. We can create them using the RStudio interface by clicking on the “New Folder” button in the file pane (bottom right), or directly from R by typing at console:
dir.create("data")
dir.create("data_output")
dir.create("fig_output")
Go to the Figshare page for this curriculum and download the dataset called “books.csv”. The direct download link is: https://ndownloader.figshare.com/files/22031487. Place this downloaded file in the data/ you just created. Alternatively, you can do this directly from R by copying and pasting this in your terminal
download.file("https://ndownloader.figshare.com/files/22031487",
"data/books.csv", mode = "wb")
Now if you navigate to your data folder, the books.csv file should be there. We now need to load it into our R session.