5.2 Reading / Writing Data
Of course, if we want work on data which is NOT included in R, we have to read that data into R in order to work with it. That is, the data are normally somewhere on your computer’s hard drive (or SSD), and you must run a command in R which reads that data into your R environment.
5.2.1 Olympic Athletes Example
Let’s look at another example, this time with a data set of Olympic athletes. This is just a subset of the full dataset, to make it easier for you to work with. Here’s how we’ll read them into R:
# Read the csv file into a data frame called athletes
athletes <- read.csv("data_raw/olympic_athletes.csv")
# Print a summary of the data frame
summary(athletes)
X ID Name Sex
Min. : 1 Min. : 4 Length:5000 Length:5000
1st Qu.:1251 1st Qu.: 35321 Class :character Class :character
Median :2500 Median : 68266 Mode :character Mode :character
Mean :2500 Mean : 68668
3rd Qu.:3750 3rd Qu.:102377
Max. :5000 Max. :135559
Age Height Weight Team
Min. :12.00 Min. :139.0 Min. : 33.00 Length:5000
1st Qu.:21.00 1st Qu.:168.5 1st Qu.: 61.00 Class :character
Median :25.00 Median :175.0 Median : 70.00 Mode :character
Mean :25.65 Mean :175.4 Mean : 70.91
3rd Qu.:28.00 3rd Qu.:183.0 3rd Qu.: 80.00
Max. :74.00 Max. :223.0 Max. :182.00
NA's :183 NA's :1109 NA's :1131
NOC Games Year Season
Length:5000 Length:5000 Min. :1896 Length:5000
Class :character Class :character 1st Qu.:1960 Class :character
Mode :character Mode :character Median :1988 Mode :character
Mean :1978
3rd Qu.:2002
Max. :2016
City Sport Event Medal
Length:5000 Length:5000 Length:5000 Length:5000
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
5.2.2 Locating your data set
R is capable of reading data from your computer, no matter where it is, as long as you “point” R to the correct location.
The location is usually specified with a file path, which is a character string that specifies the folder structure that holds your file.
By default, R starts “looking” from the current working directory, and the file path used was data_raw/olympic_athletes.csv
.
This means that R will look inside the current working directory for a folder called data_raw
, and if it exists, R will look inside data_raw
for a file called olympic_athletes.csv
.
In this class, you should be working within an RStudio project, which automatically sets the working directory.
If you created the folders as instructed earlier, then you should already have a data_raw
folder in your project directory.
data_raw
folder.
In your progress check document, simply write: “Olympic Data Downloaded”.
5.2.3 Reading CSV files
A common way of storing data in a computer file is called CSV, which stands for comma-separated values.
These files are plain text, so you can open them in any text editor like Word, Notepad, or even Google Docs.
Just like a data frame, these files contain data in rows and columns, where on each line, the columns are separated from each other with a comma (,
), which is technically called a delimiter.
Programs like Excel, Google Sheets, and R can read these files and understand their row-column structure.
In R, the function to read CSV files (as you saw above) is read.csv
.
In addition, if you call up the help file for read.csv using ?
, you’ll see that there are other similar functions as well, including read.table
, and read.delim
.
In many object oriented languages, the “dot” (.
) is a
special symbol used to access an attribute or method of an object. In R,
however, variable names and function names can contain a dot, and the
dot has no special purpose. There are some exceptions, however, that
relate to function overloading, and R formulas, but these are advanced
topics that will not be discussed here.
These functions are actually all different variations of the same, generic, function called read.table
.
The difference is that read.csv
, read.delim
, and the others make different assumptions about what delimiters are being used, and how decimal numbers are displayed (e.g. one-and-a-half may be written as 1.5
, or 1,5
depending on where you live).
We will discuss functions and arguments more in the next chapter, but for now, see the following table for when to use each function:
Function | Delimiter | Decimals |
---|---|---|
read.table | Must specify with sep=… | . |
read.csv | , | . |
read.csv2 | ; | , |
read.delim | tab) | . |
read.delim2 | tab) | , |
Any of these functions accepts the argument header=FALSE
, which indicates that the first row of the file does not contain column names.
Without this argument, R will assume the first row does contain column names.
If our Olympic athletes data did not contain headers in the first row, we would use this:
5.2.4 Writing CSV files
R also has the capability to write a data frame to a CSV file on your computer, that could then be read by Excel, Sheets, etc.
Let’s suppose we wanted to save a version of the athletes data with only the Sex
and Age
columns.
We could use the write.csv
function:
# Make a new data frame with only the Sex and Age columns
athletes2 <- athletes[,c("Sex", "Age")]
# Save the new data frame as a CSV in the clean data folder
write.csv(athletes2, "data_clean/olympic_athletes_age_sex.csv")
Notice we created a new data frame by selecting only the desired columns.
We will talk more about different ways to select data when we discuss indexing.
Notice also that the write.csv
function requires that we give it the name of the data frame being saved (athletes2
), then the file path for the csv file that the data will be written to ("data_clean/olympic_athletes_age_sex.csv"
).
write.csv
is an example of a function which takes
multiple arguments, separating them with a comma
(,
). Usually, these arguments must be specified in order,
but more will be said about this later.
Create a version of the athletes data frame which contains the
athletes’ names and their sports. Save the new data frame as a CSV file
in your data_clean
folder with the file name
“olympic_athletes_name_sport.csv”. Include the code in your progress
check assignment.
The read
and write
terminology may seem odd
if you have not heard those terms before. Your computer’s hard drive (or
SSD) will store data which will be remembered even after you turn off
your computer. The process of getting data from, and putting data on
your hard drive (or SSD) is called reading and
writing.
Any feedback for this section? Click here