Home | Table of Contents | Discover Coding | SURGE

Starting with Data

Question

What is a data.frame?
How can I read a complete csv file into R?
How can I get basic summary information about my dataset?
How can I change the way R treats strings in my dataset?
Why would I want strings to be treated differently?
How are dates represented in R and how can I change the format?

Objectives

Describe what a data frame is.
Load external data from a .csv file into a data frame.
Summarize the contents of a data frame.
Describe the difference between a factor and a string.
Convert between strings and factors.
Reorder and rename factors.
Change how character strings are handled in a data frame.
Examine and change date formats.

We are going to skip a few steps at the moment. Because we are not in the RStudio environment, things are a little easier, but rest assured the instructions for RStudio are available in a separte notbook in this folder.

R has some base functions for reading a local data file into your R session–namely read.table() and read.csv(), but these have some idiosyncrasies that were improved upon in the readr package, which is installed and loaded with tidyverse.

library(tidyverse)

To get our sample data into our R session, we will use the read_csv() function and assign it to the books value.

books <- read_csv("./data/books.csv")

You will see the message Parsed with column specification, followed by each column name and its data type. When you execute read_csv on a data file, it looks through the first 1000 rows of each column and guesses the data type for each column as it reads it into R. For example, in this dataset, it reads SUBJECT as col_character (character), and TOT.CHKOUT as col_double. You have the option to specify the data type for a column manually by using the col_types argument in read_csv.

You should now have an R object called books in the Environment pane: 10000 observations of 12 variables. We will be using this data file in the next module.

NOTE : read_csv() assumes that fields are delineated by commas, however, in several countries, the comma is used as a decimal separator and the semicolon (;) is used as a field delineator. If you want to read in this type of files in R, you can use the read_csv2 function. It behaves exactly like read_csv but uses different parameters for the decimal and the field separators. If you are working with another format, they can be both specified by the user. Check out the help for read_csv() by typing ?read_csv to learn more. There is also the read_tsv() for tab-separated data files, and read_delim() allows you to specify more details about the structure of your file.

What are data frames and tibbles?

Data frames are the de facto data structure for tabular data in R, and what we use for data processing, statistics, and plotting.

A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors). For example, here is a figure depicting a data frame comprising a numeric, a character, and a logical vector.

A data frame can be created by hand, but most commonly they are generated by the functions read_csv() or read_table(); in other words, when importing spreadsheets from your hard drive (or the web).

A tibble is an extension of R data frames used by the tidyverse. When the data is read using read_csv(), it is stored in an object of class tbl_df, tbl, and data.frame. You can see the class of an object with class().

Inspecting data frames

When calling a tbl_df object (like interviews here), there is already a lot of information about our data frame being displayed such as the number of rows, the number of columns, the names of the columns, and as we just saw the class of data stored in each column. However, there are functions to extract this information from data frames. Here is a non-exhaustive list of some of these functions. Let’s try them out!

Size:
- dim(books) - returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
- nrow(books) - returns the number of rows
- ncol(books) - returns the number of columns
Content:
- head(books) - shows the first 6 rows
- tail(books) - shows the last 6 rows
Names:
- names(books) - returns the column names (synonym of colnames() for data.frame objects)
Summary:
- View(books) - look at the data in the viewer
- str(books) - structure of the object and information about the class, length and content of each column
- summary(books) - summary statistics for each column

Note: most of these functions are “generic”, they can be used on other types of objects besides data frames.

The map() function from purrr is a useful way of running a function on all variables in a data frame or list. If you loaded the tidyverse at the beginning of the session, you also loaded purrr. Here we call class() on books using map_chr(), which will return a character vector of the classes for each variable.

map_chr(books, class)

Indexing and subsetting data frames

Our books data frame has 2 dimensions: rows (observations) and columns (variables). If we want to extract some specific data from it, we need to specify the “coordinates” we want from it. In the last session, we used square brackets [ ] to subset values from vectors. Here we will do the same thing for data frames, but we can now add a second dimension. Row numbers come first, followed by column numbers. However, note that different ways of specifying these coordinates lead to results with different classes.

## first element in the first column of the data frame (as a vector)
books[1, 1]
## first element in the 6th column (as a vector)
books[1, 6]
## first column of the data frame (as a vector)
books[[1]]
## first column of the data frame (as a data.frame)
books[1]
## first three elements in the 7th column (as a vector)
books[1:3, 7]
## the 3rd row of the data frame (as a data.frame)
books[3, ]
## equivalent to head_books <- head(books)
head_books <- books[1:6, ]

Dollar sign

The dollar sign $ is used to distinguish a specific variable (column, in Excel-speak) in a data frame:

head(books$X245.ab)  # print the first six book titles

# print the mean number of checkouts
mean(books$TOT.CHKOUT)

unique(), table(), and duplicated()

unique()

to see all the distinct values in a variable:

unique(books$BCODE2)

table()

to get quick frequency counts on a variable:

table(books$BCODE2)  # frequency counts on a variable

You can combine table() with relational operators:

table(books$TOT.CHKOUT > 50)  # how many books have 50 or more checkouts?

duplicated()

will give you the a logical vector of duplicated values.

duplicated(books$ISN)  # a TRUE/FALSE vector of duplicated values in the ISN column
!duplicated(books$ISN)  # you can put an exclamation mark before it to get non-duplicated values
table(duplicated(books$ISN))  # run a table of duplicated values
which(duplicated(books$ISN))  # get row numbers of duplicated values

Exploring missing values

You may also need to know the number of missing values:

sum(is.na(books))  # How many total missing values?

colSums(is.na(books))  # Total missing values per column

table(is.na(books$ISN))  # use table() and is.na() in combination

booksNoNA <- na.omit(books)  # Return only observations that have no missing values

Exercise 3.1

Call View(books) to examine the data frame. Use the small arrow buttons in the variable name to sort tot_chkout by the highest checkouts. What item has the most checkouts?
What is the class of the TOT.CHKOUT variable?
Use table() and is.na() to find out how many NA values are in the ISN variable.
Call summary(books$ TOT.CHKOUT). What can we infer when we compare the mean, median, and max?
hist() will print a rudimentary histogram, which displays frequency counts. Call hist(books$TOT.CHKOUT). What is this telling us?

#Exercise 3.1

Logical tests

R contains a number of operators you can use to compare values. Use help(Comparison) to read the R help file. Note that two equal signs (==) are used for evaluating equality (because one equals sign (=) is used for assigning variables).

Operator	Function
<	Less Than
>	Greater Than
==	Equal To
<=	Less Than or Equal To
>=	Greater Than or Equal To
!=	Not Equal To
%ini%	Has a Match In
is.na()	Is NA
!is.na()	Is Not NA

Sometimes you need to do multiple logical tests (think Boolean logic). Use help(Logic) to read the help file.

Operator	Function
&	boolean AND
		boolean OR
!	boolean NOT
any()	Are some values true?
all()	Are all values true?

Key Points

- Use read.csv to read tabular data in R.
- Use factors to represent categorical data in R.