Data Cleaning & Tranformation
library(fs)
Load the tidyverse
and data frame into your R session
library(tidyverse)
Warning message in system("timedatectl", intern = TRUE):
โrunning command 'timedatectl' had status 1โ
โโ [1mAttaching packages[22m โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ tidyverse 1.3.1 โโ
[32mโ[39m [34mggplot2[39m 3.3.3 [32mโ[39m [34mpurrr [39m 0.3.4
[32mโ[39m [34mtibble [39m 3.1.1 [32mโ[39m [34mdplyr [39m 1.0.5
[32mโ[39m [34mtidyr [39m 1.1.3 [32mโ[39m [34mstringr[39m 1.4.0
[32mโ[39m [34mreadr [39m 1.4.0 [32mโ[39m [34mforcats[39m 0.5.1
โโ [1mConflicts[22m โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ tidyverse_conflicts() โโ
[31mโ[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mโ[39m [34mdplyr[39m::[32mlag()[39m masks [34mstats[39m::lag()
books <- read_csv("data/books.csv") # load the data and assign it to books
[36mโโ[39m [1m[1mColumn specification[1m[22m [36mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ[39m
cols(
CALL...BIBLIO. = [31mcol_character()[39m,
X245.ab = [31mcol_character()[39m,
X245.c = [31mcol_character()[39m,
LOCATION = [31mcol_character()[39m,
TOT.CHKOUT = [32mcol_double()[39m,
LOUTDATE = [31mcol_character()[39m,
SUBJECT = [31mcol_character()[39m,
ISN = [31mcol_character()[39m,
CALL...ITEM. = [31mcol_character()[39m,
X008.Date.One = [31mcol_character()[39m,
BCODE2 = [31mcol_character()[39m,
BCODE1 = [31mcol_character()[39m
)
Transforming data with dplyr
We are now entering the data cleaning and transforming phase. While it is possible to do much of the following using Base R functions (in other words, without loading an external package) dplyr
makes it much easier. Like many of the most useful R packages, dplyr
was developed by data scientist http://hadley.nz/.
dplyr
is a package for making tabular data manipulation easier by using a limited set of functions that can be combined to extract and summarize insights from your data. It pairs nicely with tidyr
which enables you to swiftly convert between different data formats (long vs. wide) for plotting and analysis.
dplyr
is also part of the tidyverse
. Letโs make sure we are all on the same page by loading the tidyverse
and the books
dataset we downloaded earlier.
Weโre going to learn some of the most common dplyr
functions:
rename()
: rename columnsrecode()
: recode values in a columnselect()
: subset columnsfilter()
: subset rows on conditionsmutate()
: create new columns by using information from other columnsgroup_by()
andsummarize()
: create summary statistics on grouped dataarrange()
: sort resultscount()
: count discrete values
Renaming variables
It is often necessary to rename variables to make them more meaningful. If you print the names of the sample books
dataset you can see that some of the vector names are not particularly helpful:
glimpse(books)
Rows: 10,000
Columns: 12
$ CALL...BIBLIO. [3m[90m<chr>[39m[23m "001.94 Don 2000", "001.942 Bro 1999", "027.073 App 200โฆ
$ X245.ab [3m[90m<chr>[39m[23m "Bermuda Triangle /", "Invaders from outer space :|realโฆ
$ X245.c [3m[90m<chr>[39m[23m "written by Andrew Donkin.", "written by Philip Brooks.โฆ
$ LOCATION [3m[90m<chr>[39m[23m "juv", "juv", "juv", "juv", "juv", "juv", "juv", "juv",โฆ
$ TOT.CHKOUT [3m[90m<dbl>[39m[23m 6, 2, 3, 6, 7, 6, 4, 2, 4, 13, 6, 7, 3, 22, 2, 9, 4, 8,โฆ
$ LOUTDATE [3m[90m<chr>[39m[23m "11-21-2013 9:44", "02-07-2004 15:29", "10-16-2007 10:5โฆ
$ SUBJECT [3m[90m<chr>[39m[23m "Readers (Elementary)|Bermuda Triangle -- Juvenile liteโฆ
$ ISN [3m[90m<chr>[39m[23m "0789454165 (hbk.)~0789454157 (pbk.)", "0789439999 (harโฆ
$ CALL...ITEM. [3m[90m<chr>[39m[23m "001.94 Don 2000", "001.942 Bro 1999", "027.073 App 200โฆ
$ X008.Date.One [3m[90m<chr>[39m[23m "2000", "1999", "2001", "1999", "2000", "2001", "2001",โฆ
$ BCODE2 [3m[90m<chr>[39m[23m "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", โฆ
$ BCODE1 [3m[90m<chr>[39m[23m "j", "j", "j", "j", "j", "j", "j", "j", "j", "j", "j", โฆ
There are many ways to rename variables in R, but the rename()
function in the dplyr
package is the easiest and most straightforward. The new variable name comes first. See help(rename)
.
Here we rename the X245.ab variable. Make sure you assign the output to your books
value, otherwise it will just print it to the console. In other words, we are overwritting the previous books
value with the new one, with X245.ab
changed to title
# rename the . Make sure you return (<-) the output to your
# variable, otherwise it will just print it to the console
books <- rename(books,
title = X245.ab)
Side note:
Where does X245.ab come from? That is the MARC field 245 |
ab. However, because R variables cannot start with a number, R automatically inserted and X, and because pipes | are not allowed in variable names, R replaced it with a period. |
# rename multiple variables at once
books <- rename(books,
author = X245.c,
callnumber = CALL...BIBLIO.,
isbn = ISN,
pubyear = X008.Date.One,
subCollection = BCODE1,
format = BCODE2,
location = LOCATION,
tot_chkout = TOT.CHKOUT,
loutdate = LOUTDATE,
subject = SUBJECT)
books
WARNING: Some output was deleted.
Exercise: Rename CALL...ITEM.
- Use rename() to rename the CALLโฆITEM. column to callnumber2. Remember to add the period to the end of the CALLโฆITEM. value
Recoding values
It is often necessary to recode or reclassify values in your data. For example, in the sample dataset provided to you, the sub_collection
(formerly BCODE1
) and format
(formerly BCODE2
) variables contain single characters. You can do this easily using the recode()
function, also in the dplyr
package. Unlike rename()
, the old value comes first here. Also notice that we are overwriting the books$subCollection
variable.
# first print to the console all of the unique values you will need to recode
distinct(books, subCollection)
subCollection |
---|
<chr> |
j |
b |
u |
r |
- |
s |
c |
z |
a |
t |
books$subCollection <- recode(books$subCollection,
"-" = "general collection",
u = "government documents",
r = "reference",
b = "k-12 materials",
j = "juvenile",
s = "special collections",
c = "computer files",
t = "theses",
a = "archives",
z = "reserves")
books
WARNING: Some output was deleted.
Do the same for the format
column. Note that you must put "5"
and "4"
into quotation marks for the function to operate correctly.
books$format <- recode(books$format,
a = "book",
e = "serial",
w = "microform",
s = "e-gov doc",
o = "map",
n = "database",
k = "cd-rom",
m = "image",
"5" = "kit/object",
"4" = "online video")
Subsetting dataframes
Subsetting using filter()
in the dplyr
package
In the last lesson we learned how to subset a data frame using brackets. As with other R functions, the dplyr
package makes it much more straightforward, using the filter()
function.
Here we will create a subset of books
called booksOnly
, which includes only those items where the format is books. Notice that we use two equal signs ==
as the logical operator:
booksOnly <- filter(books, format == "book") # filter books to return only those items where the format is books
You can also use multiple filter conditions. Here, the order matters: first we filter to include only books, then of the results, we include only items that have more than zero checkouts.
bookCheckouts <- filter(books,
format == "book",
tot_chkout > 0)
How many items were removed? You can find out functionally with:
nrow(books) - nrow(bookCheckouts)
5733
You can then check the summary statistics of checkouts for books with more than zero checkouts. Notice how different these numbers are from the previous lesson, when we kept zero in. The median is now 3 and the mean is 5.
summary(bookCheckouts$tot_chkout)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 3.000 5.281 6.000 113.000
If you want to filter on multiple conditions within the same variable, use the %in%
operator combined with a vector of all the values you wish to include within c()
. For example, you may want to include only items in the format serial
and microform
:
serial_microform <- filter(books, format %in% c("serial", "microform"))
Exercise: Filtering with filter()
- Use
filter()
to create a data frame calledbooksJuv
consisting offormat
books andsubCollection
juvenile materials. - Use
mean()
to check the average number of checkouts for thebooksJuv
data frame
Selecting variables
The select()
function allows you to keep or remove specific columns. It also provides a convenient way to reorder variables.
# specify the variables you want to keep by name
booksTitleCheckouts <- select(books, title, tot_chkout)
booksTitleCheckouts
title | tot_chkout |
---|---|
<chr> | <dbl> |
Bermuda Triangle / | 6 |
Invaders from outer space :|real-life stories of UFOs / | 2 |
Down Cut Shin Creek :|the pack horse librarians of Kentucky / | 3 |
The Chinese book of animal powers / | 6 |
Judge Judy Sheindlin's Win or lose by how you choose! / | 7 |
Judge Judy Sheindlin's You can't judge a book by its cover :|cool rules for school / | 6 |
A young child's Bible / | 4 |
God's Kingdom :|stories from the New Testament / | 2 |
Roman myths / | 4 |
Greek gods and goddesses / | 13 |
The story of Shabbat / | 6 |
On Shabbat / | 7 |
On Purim / | 3 |
Coming to America :|the story of immigration / | 22 |
The shared heart :|portraits and stories celebrating lesbian, gay, and bisexual young people / | 2 |
Remember the ladies :|100 great American women / | 9 |
One Belfast boy / | 4 |
One boy from Kosovo / | 8 |
Slavery :|bondage throughout history / | 3 |
River town / | 5 |
Kids on strike! / | 9 |
A kids' guide to America's Bill of Rights :|curfews, censorship, and the 100-pound giant / | 4 |
Do you remember the color blue? :|and other questions kids ask about blindness / | 5 |
The wildlife detectives :|how forensic scientists fight crimes against nature / | 2 |
The environmental movement :|from its roots to the challenges of a new century / | 3 |
After the spill :|the Exxon Valdez disaster, then and now / | 5 |
Juvenile crime / | 4 |
One-room school / | 6 |
Reader's choice :|the books children love to read and teachers love to teach. | 1 |
Reader's choice :|the books children love to read and teachers love to teach. | 7 |
โฎ | โฎ |
Lectura Scott Foresman / | 4 |
Employment dispute resolution and worker rights in the changing workplace / | 2 |
The best teacher "stuff" / | 8 |
Rosso in Italy :|the Italian career of Rosso Fiorentino / | 0 |
CL locker keys | 13 |
Many peoples, one land :|a guide to new multicultural literature for children and young adults / | 0 |
Conceptual revolutions / | 0 |
The journals of gerontology. | 0 |
U.S. master tax guide. | 0 |
Central business review. | 0 |
Steamboat! :|the story of Captain Blanche Leathers / | 5 |
Collections.|a Harcourt reading/language arts program. | 20 |
Co-operation in international business :|the use of limited equity arrangements / | 0 |
McGraw-Hill reading. | 4 |
McGraw-Hill lectura / | 2 |
Collections :|a Harcourt reading program / | 0 |
Voz del lector :|los libros que los ninos quieren leer Y los maestros quieren ensenar. | 7 |
Early supplementary feeding and cognition :|effects over two decades / | 0 |
Collections :|a Harcourt reading/language arts program / | 18 |
Lectura Scott Foresman / | 3 |
McGraw-Hill lectura / | 1 |
McGraw-Hill lectura / | 1 |
Vamos de fiesta! :|a Harcourt Spanish reading/language arts program / | 0 |
Scott Foresman reading / | 4 |
Vamos de fiesta! :|a Harcourt Spanish reading/language arts program / | 6 |
Population today. | 0 |
Practical plays / | 3 |
<span style=white-space:pre-wrap>Science, technology & human values. </span> | 0 |
CL locker keys | 13 |
Vamos de fiesta! :|a Harcourt Spanish reading/language arts program / | 2 |
# specify the variables you want to remove with a -
books <- select(books, -location)
# reorder columns, combined with everything()
booksReordered <- select(books, title, tot_chkout, loutdate, everything())
Ordering data
The arrange()
function in the dplyr
package allows you to sort your data by alphabetical or numerical order
booksTitleArrange <- arrange(books, title)
# use desc() to sort a variable in descending order
booksHighestChkout <- arrange(books, desc(tot_chkout))
booksHighestChkout
WARNING: Some output was deleted.
# order data based on multiple variables (e.g. sort first by checkout, then by publication year)
booksChkoutYear <- arrange(books, desc(tot_chkout), desc(pubyear))
Creating new variables
The mutate()
function allows you to create new variables. Here, we use the str_sub()
function from the stringr
package to extract the first character of the callnumber
variable (the call number class) and put it into a new column called call_class
.
booksLC <- mutate(books,
call_class = str_sub(callnumber, 1, 1))
There are two numbers because you must specify a start and an end valueโhere, we start with the first character, and end with the first character.
mutate()
is also helpful to coerce a column from one data type to another. For example, we can see there are some errors in the pubyear
variableโsome dates are 19zz
or uuuu
. As a result, this variable was read in as a character
rather than an integer
.
books <- mutate(books, pubyear = as.integer(pubyear))
Warning message in mask$eval_all_mutate(quo):
โNAs introduced by coercionโ
We see the error message NAs introduced by coercion
. This is because non-numerical variables become NA
and the remainder become integers.
Pattern matching
Cleaning text with the stringr
package is easier when you have a basic understanding of โregexโ, or regular expression pattern matching. Regex is especially useful for manipulating strings (alphanumeric data), and is the backbone of search-and-replace operations in most applications. Pattern matching is common to all programming languages but regex syntax is often code-language specific. Below, find an example of using pattern matching to find and replace data in R:
- Remove the trailing slash in the title column
- Modify the punctuation separating the title from a subtitle
Note: If the final product of this data will be imported into an ILS, you may not want to alter the MARC specific punctuation. All other audiences will appreciate the text normalizing steps.
Read more about matching patterns with regular expressions.
books %>%
mutate(title_modified = str_remove(title, "/$")) %>% # remove the trailing slash
mutate(title_modified = str_replace(title_modified, "\\s:\\|", ": ")) %>% # replace ' :|' with ': '
select(title_modified, title)
title_modified | title |
---|---|
<chr> | <chr> |
Bermuda Triangle | Bermuda Triangle / |
Invaders from outer space: real-life stories of UFOs | Invaders from outer space :|real-life stories of UFOs / |
Down Cut Shin Creek: the pack horse librarians of Kentucky | Down Cut Shin Creek :|the pack horse librarians of Kentucky / |
The Chinese book of animal powers | The Chinese book of animal powers / |
Judge Judy Sheindlin's Win or lose by how you choose! | Judge Judy Sheindlin's Win or lose by how you choose! / |
Judge Judy Sheindlin's You can't judge a book by its cover: cool rules for school | Judge Judy Sheindlin's You can't judge a book by its cover :|cool rules for school / |
A young child's Bible | A young child's Bible / |
God's Kingdom: stories from the New Testament | God's Kingdom :|stories from the New Testament / |
Roman myths | Roman myths / |
Greek gods and goddesses | Greek gods and goddesses / |
The story of Shabbat | The story of Shabbat / |
On Shabbat | On Shabbat / |
On Purim | On Purim / |
Coming to America: the story of immigration | Coming to America :|the story of immigration / |
The shared heart: portraits and stories celebrating lesbian, gay, and bisexual young people | The shared heart :|portraits and stories celebrating lesbian, gay, and bisexual young people / |
Remember the ladies: 100 great American women | Remember the ladies :|100 great American women / |
One Belfast boy | One Belfast boy / |
One boy from Kosovo | One boy from Kosovo / |
Slavery: bondage throughout history | Slavery :|bondage throughout history / |
River town | River town / |
Kids on strike! | Kids on strike! / |
A kids' guide to America's Bill of Rights: curfews, censorship, and the 100-pound giant | A kids' guide to America's Bill of Rights :|curfews, censorship, and the 100-pound giant / |
Do you remember the color blue?: and other questions kids ask about blindness | Do you remember the color blue? :|and other questions kids ask about blindness / |
The wildlife detectives: how forensic scientists fight crimes against nature | The wildlife detectives :|how forensic scientists fight crimes against nature / |
The environmental movement: from its roots to the challenges of a new century | The environmental movement :|from its roots to the challenges of a new century / |
After the spill: the Exxon Valdez disaster, then and now | After the spill :|the Exxon Valdez disaster, then and now / |
Juvenile crime | Juvenile crime / |
One-room school | One-room school / |
Reader's choice: the books children love to read and teachers love to teach. | Reader's choice :|the books children love to read and teachers love to teach. |
Reader's choice: the books children love to read and teachers love to teach. | Reader's choice :|the books children love to read and teachers love to teach. |
โฎ | โฎ |
Lectura Scott Foresman | Lectura Scott Foresman / |
Employment dispute resolution and worker rights in the changing workplace | Employment dispute resolution and worker rights in the changing workplace / |
The best teacher "stuff" | The best teacher "stuff" / |
Rosso in Italy: the Italian career of Rosso Fiorentino | Rosso in Italy :|the Italian career of Rosso Fiorentino / |
CL locker keys | CL locker keys |
Many peoples, one land: a guide to new multicultural literature for children and young adults | Many peoples, one land :|a guide to new multicultural literature for children and young adults / |
Conceptual revolutions | Conceptual revolutions / |
The journals of gerontology. | The journals of gerontology. |
U.S. master tax guide. | U.S. master tax guide. |
Central business review. | Central business review. |
Steamboat!: the story of Captain Blanche Leathers | Steamboat! :|the story of Captain Blanche Leathers / |
Collections.|a Harcourt reading/language arts program. | Collections.|a Harcourt reading/language arts program. |
Co-operation in international business: the use of limited equity arrangements | Co-operation in international business :|the use of limited equity arrangements / |
McGraw-Hill reading. | McGraw-Hill reading. |
McGraw-Hill lectura | McGraw-Hill lectura / |
Collections: a Harcourt reading program | Collections :|a Harcourt reading program / |
Voz del lector: los libros que los ninos quieren leer Y los maestros quieren ensenar. | Voz del lector :|los libros que los ninos quieren leer Y los maestros quieren ensenar. |
Early supplementary feeding and cognition: effects over two decades | Early supplementary feeding and cognition :|effects over two decades / |
Collections: a Harcourt reading/language arts program | Collections :|a Harcourt reading/language arts program / |
Lectura Scott Foresman | Lectura Scott Foresman / |
McGraw-Hill lectura | McGraw-Hill lectura / |
McGraw-Hill lectura | McGraw-Hill lectura / |
Vamos de fiesta!: a Harcourt Spanish reading/language arts program | Vamos de fiesta! :|a Harcourt Spanish reading/language arts program / |
Scott Foresman reading | Scott Foresman reading / |
Vamos de fiesta!: a Harcourt Spanish reading/language arts program | Vamos de fiesta! :|a Harcourt Spanish reading/language arts program / |
Population today. | Population today. |
Practical plays | Practical plays / |
<span style=white-space:pre-wrap>Science, technology & human values. </span> | <span style=white-space:pre-wrap>Science, technology & human values. </span> |
CL locker keys | CL locker keys |
Vamos de fiesta!: a Harcourt Spanish reading/language arts program | Vamos de fiesta! :|a Harcourt Spanish reading/language arts program / |
Putting it all together with %>%
The Pipe Operator %>%
is loaded with the tidyverse
. It takes the output of one statement and makes it the input of the next statement. You can think of it as โthenโ in natural language. So instead of making a bunch of intermediate data frames and cluttering up your workspace, you can run multiple functions at once. You can type the pipe with Ctrl + Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.
So in the following example, the books
tibble is first called, then the format
is filtered to include only book
, then only the title
and tot_chkout
columns are selected, and finally the data is rearranged from most to least checkouts.
myBooks <- books %>%
filter(format == "book") %>%
select(title, tot_chkout) %>%
arrange(desc(tot_chkout))
myBooks
title | tot_chkout |
---|---|
<chr> | <dbl> |
Click, clack, moo :|cows that type / | 113 |
The three pigs / | 106 |
Cook-a-doodle-doo! / | 103 |
Because of Winn-Dixie / | 79 |
Uptown / | 69 |
Through my eyes / | 63 |
Bud, not Buddy / | 63 |
Brown bear, brown bear, what do you see? / | 61 |
A year down yonder / | 47 |
Wemberly worried / | 43 |
Oso pardo, oso pardo, quรฉ ves ahรญ? / | 41 |
Discrete mathematics with applications / | 40 |
Oh say can you seed? / | 39 |
Art since 1940 :|strategies of being / | 39 |
Family / | 39 |
Calculus and analytic geometry / | 39 |
The keeping quilt / | 38 |
Miss Spider's tea party / | 38 |
There's no place like space / | 36 |
Mufaro's beautiful daughters :|an African tale / | 36 |
100th day worries / | 35 |
The yellow star :|the legend of King Christian X of Denmark / | 35 |
The tipping point :|how little things can make a big difference / | 35 |
Cindy Ellen :|a wild western Cinderella / | 34 |
Apples / | 34 |
The hero with a thousand faces / | 34 |
Where do balloons go? :|an uplifting mystery / | 34 |
Scott Foresman reading / | 34 |
Math advantage / | 34 |
Roughing it on the Oregon Trail / | 33 |
โฎ | โฎ |
Vamos de fiesta! :|a Harcourt Spanish reading/language arts program / | 0 |
Fearon's world history / | 0 |
National survey of problems and competencies among four- to sixteen-year-olds :|parents' reports for normative and clinical samples / | 0 |
Prentice Hall middle grades math :|tools for success / | 0 |
Information literacy toolkit. | 0 |
Index to artistic biography. | 0 |
AWRA symposium on water resources and the World Wide Web, December 5-9, 1999, Seattle, Washington / | 0 |
May Sarton :|a bibliography / | 0 |
McGraw-Hill lectura / | 0 |
Glencoe matemรกticas :|aplicaciones y conexiones / | 0 |
Index to artistic biography. | 0 |
Library of Congress Africana collections :|an illustrated guide. | 0 |
McGraw-Hill lectura / | 0 |
SRA Open Court reading. / | 0 |
McGraw-Hill reading. | 0 |
SRA Open Court reading. / | 0 |
McGraw-Hill reading. | 0 |
Guided participation in cultural activity by toddlers and caregivers / | 0 |
McGraw-Hill lectura / | 0 |
Accounting handbook / | 0 |
Collections :|a Harcourt reading program / | 0 |
Macmillan/McGraw-Hill reading/language arts :|a new view. | 0 |
Vamos de fiesta! :|a Harcourt Spanish reading/language arts program / | 0 |
Rosso in Italy :|the Italian career of Rosso Fiorentino / | 0 |
Many peoples, one land :|a guide to new multicultural literature for children and young adults / | 0 |
Conceptual revolutions / | 0 |
Co-operation in international business :|the use of limited equity arrangements / | 0 |
Collections :|a Harcourt reading program / | 0 |
Early supplementary feeding and cognition :|effects over two decades / | 0 |
Vamos de fiesta! :|a Harcourt Spanish reading/language arts program / | 0 |
Exercise: Playing with pipes %>%
- Create a new data frame booksKids with these conditions:
- filter() to include subCollection juvenile & k-12 materials and format books.
- select() only title, call number, total checkouts, and publication year
- arrange() by total checkouts in descending order
- filter() to include subCollection juvenile & k-12 materials and format books.
- Use mean() to check the average number of checkouts for the booksJuv data frame.
Split-apply-combine data analysis and the summarize()
function
Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr
makes this very easy through the use of the group_by()
function.
The summarize() function
group_by()
is often used together with summarize()
, which collapses each group into a single-row summary of that group. group_by()
takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics.
So to compute the average checkouts by format:
books %>%
group_by(format) %>%
summarize(mean_checkouts = mean(tot_chkout))
format | mean_checkouts |
---|---|
<chr> | <dbl> |
book | 3.227123013 |
cd-rom | 0.333333333 |
database | 0.000000000 |
e-gov doc | 0.040241449 |
image | 0.027522936 |
kit/object | 1.333333333 |
map | 10.619047619 |
microform | 0.001216545 |
online video | 0.000000000 |
serial | 0.000000000 |
Books and maps have the highest, and as we would expect, databases, online videos, and serials have zero checkouts.
Here is a more complex example:
books %>%
filter(format == "book") %>%
mutate(call_class = str_sub(callnumber, 1, 1)) %>%
group_by(call_class) %>%
summarize(count = n(),
sum_tot_chkout = sum(tot_chkout)) %>%
arrange(desc(sum_tot_chkout))
call_class | count | sum_tot_chkout |
---|---|---|
<chr> | <int> | <dbl> |
E | 487 | 3114 |
NA | 459 | 3024 |
H | 1142 | 2902 |
P | 800 | 2645 |
F | 240 | 1306 |
Q | 333 | 1305 |
B | 426 | 1233 |
R | 193 | 981 |
L | 358 | 862 |
5 | 60 | 838 |
N | 289 | 703 |
D | 353 | 690 |
T | 165 | 451 |
3 | 56 | 416 |
Z | 189 | 287 |
G | 216 | 265 |
9 | 39 | 247 |
8 | 21 | 242 |
6 | 22 | 225 |
J | 228 | 173 |
K | 54 | 113 |
7 | 14 | 92 |
C | 98 | 92 |
M | 34 | 89 |
S | 251 | 46 |
U | 19 | 41 |
2 | 7 | 39 |
I | 272 | 31 |
4 | 3 | 24 |
1 | 3 | 19 |
A | 61 | 18 |
0 | 3 | 11 |
V | 5 | 6 |
Y | 83 | 5 |
Letโs break this down step by step:
- First we call the
books
data frame - We then pipe through
filter()
to include only books - We then create a new column with
mutate()
calledcall_class
by using thestr_sub()
function to keep the first character of thecall_number
variable - We then
group_by()
our newly createdcall_class
variable - We then create two summary columns by using
summarize()
- take the number
n()
of items percall_class
and assign it to a column calledcount
- take the the
sum()
oftot_chkout
percall_class
and assign the result to a column calledsum_tot_chkout
- take the number
- Finally, we arrange
sum_tot_chkout
in descending order, so we can see the class with the most total checkouts. We can see it is theE
class (History of America), followed byNA
(items with no call number data), followed byH
(Social Sciences) andP
(Language and Literature).
Exporting data
Now that you have learned how to use dplyr
to extract information from or summarize your raw data, you may want to export these new data sets to share them with your collaborators or for archival.
Similar to the read_csv()
function used for reading CSV files into R, there is a write_csv()
function that generates CSV files from data frames.
Before using write_csv()
, we are going to create a new folder, data_output
, in our working directory that will store this generated dataset. We donโt want to write generated datasets in the same directory as our raw data. Itโs good practice to keep them separate. The data
folder should only contain the raw, unaltered data, and should be left alone to make sure we donโt delete or modify it. In contrast, our script will generate the contents of the data_output
directory, so even if the files it contains are deleted, we can always re-generate them.
In preparation for our next lesson on plotting, we are going to create a version of the dataset with most of the changes we made above. We will first read in the original, then make all the changes with pipes.
books_reformatted <- read_csv("./data/books.csv") %>%
rename(title = X245.ab,
author = X245.c,
callnumber = CALL...BIBLIO.,
isbn = ISN,
pubyear = X008.Date.One,
subCollection = BCODE1,
format = BCODE2,
location = LOCATION,
tot_chkout = TOT.CHKOUT,
loutdate = LOUTDATE,
subject = SUBJECT,
callnumber2 = CALL...ITEM.) %>%
mutate(pubyear = as.integer(pubyear),
call_class = str_sub(callnumber, 1, 1),
subCollection = recode(subCollection,
"-" = "general collection",
u = "government documents",
r = "reference",
b = "k-12 materials",
j = "juvenile",
s = "special collections",
c = "computer files",
t = "theses",
a = "archives",
z = "reserves"),
format = recode(format,
a = "book",
e = "serial",
w = "microform",
s = "e-gov doc",
o = "map",
n = "database",
k = "cd-rom",
m = "image",
"5" = "kit/object",
"4" = "online video"))
[36mโโ[39m [1m[1mColumn specification[1m[22m [36mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ[39m
cols(
CALL...BIBLIO. = [31mcol_character()[39m,
X245.ab = [31mcol_character()[39m,
X245.c = [31mcol_character()[39m,
LOCATION = [31mcol_character()[39m,
TOT.CHKOUT = [32mcol_double()[39m,
LOUTDATE = [31mcol_character()[39m,
SUBJECT = [31mcol_character()[39m,
ISN = [31mcol_character()[39m,
CALL...ITEM. = [31mcol_character()[39m,
X008.Date.One = [31mcol_character()[39m,
BCODE2 = [31mcol_character()[39m,
BCODE1 = [31mcol_character()[39m
)
Warning message in mask$eval_all_mutate(quo):
โNAs introduced by coercionโ
This chunk of code read the CSV, renamed the variables, used mutate()
in combination with recode()
to recode the format
and subCollection
values, used mutate()
in combination with as.integer()
to coerce pubyear
to integer, and used mutate()
in combination with str_sub
to create the new varable call_class
.
We now write it to a CSV and put it in the data/output
sub-directory:
write_csv(books_reformatted, "./data_output/books_reformatted.csv")
Help with dplyr
- Read more about dplyr at https://dplyr.tidyverse.org/.
- In your console, after loading library(dplyr), run vignette(โdplyrโ) to read an extremely helpful explanation of how to use it.
- See the http://r4ds.had.co.nz/transform.html in Garrett Grolemund and Hadley Wickhamโs book R for Data Science.
- Watch this Data School video: https://www.youtube.com/watch?v=jWjqLW-u3hc
Key Points
- Use the dplyr package to manipulate dataframes.
- Use select() to choose variables from a dataframe.
- Use filter() to choose data based on values. Use group_by() and summarize() to work with subsets of data. Use mutate() to create new variables.