5.2 Reading / Writing Data

Of course, if we want work on data which is NOT included in R, we have to read that data into R in order to work with it. That is, the data are normally somewhere on your computer’s hard drive (or SSD), and you must run a command in R which reads that data into your R environment.

5.2.1 Olympic Athletes Example

Let’s look at another example, this time with a data set of Olympic athletes. This is just a subset of the full dataset, to make it easier for you to work with. Here’s how we’ll read them into R:

# Read the csv file into a data frame called athletes
athletes <- read.csv("data_raw/olympic_athletes.csv")

# Print a summary of the data frame
summary(athletes)
       X              ID             Name               Sex           
 Min.   :   1   Min.   :     4   Length:5000        Length:5000       
 1st Qu.:1251   1st Qu.: 35321   Class :character   Class :character  
 Median :2500   Median : 68266   Mode  :character   Mode  :character  
 Mean   :2500   Mean   : 68668                                        
 3rd Qu.:3750   3rd Qu.:102377                                        
 Max.   :5000   Max.   :135559                                        
                                                                      
      Age            Height          Weight           Team          
 Min.   :12.00   Min.   :139.0   Min.   : 33.00   Length:5000       
 1st Qu.:21.00   1st Qu.:168.5   1st Qu.: 61.00   Class :character  
 Median :25.00   Median :175.0   Median : 70.00   Mode  :character  
 Mean   :25.65   Mean   :175.4   Mean   : 70.91                     
 3rd Qu.:28.00   3rd Qu.:183.0   3rd Qu.: 80.00                     
 Max.   :74.00   Max.   :223.0   Max.   :182.00                     
 NA's   :183     NA's   :1109    NA's   :1131                       
     NOC               Games                Year         Season         
 Length:5000        Length:5000        Min.   :1896   Length:5000       
 Class :character   Class :character   1st Qu.:1960   Class :character  
 Mode  :character   Mode  :character   Median :1988   Mode  :character  
                                       Mean   :1978                     
                                       3rd Qu.:2002                     
                                       Max.   :2016                     
                                                                        
     City              Sport              Event              Medal          
 Length:5000        Length:5000        Length:5000        Length:5000       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
The above command only works because the data set is in a particular location (the data folder), and is in a particular format (csv). In the sections that follow, we’ll discuss how to address both of these issues.

5.2.2 Locating your data set

R is capable of reading data from your computer, no matter where it is, as long as you “point” R to the correct location. The location is usually specified with a file path, which is a character string that specifies the folder structure that holds your file. By default, R starts “looking” from the current working directory, and the file path used was data_raw/olympic_athletes.csv. This means that R will look inside the current working directory for a folder called data_raw, and if it exists, R will look inside data_raw for a file called olympic_athletes.csv. In this class, you should be working within an RStudio project, which automatically sets the working directory. If you created the folders as instructed earlier, then you should already have a data_raw folder in your project directory.

Download the olympic athletes data set from this link and save it in your data_raw folder. In your progress check document, simply write: “Olympic Data Downloaded”.

5.2.3 Reading CSV files

A common way of storing data in a computer file is called CSV, which stands for comma-separated values. These files are plain text, so you can open them in any text editor like Word, Notepad, or even Google Docs. Just like a data frame, these files contain data in rows and columns, where on each line, the columns are separated from each other with a comma (,), which is technically called a delimiter. Programs like Excel, Google Sheets, and R can read these files and understand their row-column structure. In R, the function to read CSV files (as you saw above) is read.csv. In addition, if you call up the help file for read.csv using ?, you’ll see that there are other similar functions as well, including read.table, and read.delim.

In many object oriented languages, the “dot” (.) is a special symbol used to access an attribute or method of an object. In R, however, variable names and function names can contain a dot, and the dot has no special purpose. There are some exceptions, however, that relate to function overloading, and R formulas, but these are advanced topics that will not be discussed here.

These functions are actually all different variations of the same, generic, function called read.table. The difference is that read.csv, read.delim, and the others make different assumptions about what delimiters are being used, and how decimal numbers are displayed (e.g. one-and-a-half may be written as 1.5, or 1,5 depending on where you live). We will discuss functions and arguments more in the next chapter, but for now, see the following table for when to use each function:

Function Delimiter Decimals
read.table Must specify with sep=… .
read.csv , .
read.csv2 ; ,
read.delim tab) .
read.delim2 tab) ,

Any of these functions accepts the argument header=FALSE, which indicates that the first row of the file does not contain column names. Without this argument, R will assume the first row does contain column names. If our Olympic athletes data did not contain headers in the first row, we would use this:

athletes <- read.csv("data_raw/olympic_athletes.csv", header=FALSE)

5.2.4 Writing CSV files

R also has the capability to write a data frame to a CSV file on your computer, that could then be read by Excel, Sheets, etc. Let’s suppose we wanted to save a version of the athletes data with only the Sex and Age columns. We could use the write.csv function:

# Make a new data frame with only the Sex and Age columns
athletes2 <- athletes[,c("Sex", "Age")]

# Save the new data frame as a CSV in the clean data folder
write.csv(athletes2, "data_clean/olympic_athletes_age_sex.csv")

Notice we created a new data frame by selecting only the desired columns. We will talk more about different ways to select data when we discuss indexing. Notice also that the write.csv function requires that we give it the name of the data frame being saved (athletes2), then the file path for the csv file that the data will be written to ("data_clean/olympic_athletes_age_sex.csv").

write.csv is an example of a function which takes multiple arguments, separating them with a comma (,). Usually, these arguments must be specified in order, but more will be said about this later.

Create a version of the athletes data frame which contains the athletes’ names and their sports. Save the new data frame as a CSV file in your data_clean folder with the file name “olympic_athletes_name_sport.csv”. Include the code in your progress check assignment.

The read and write terminology may seem odd if you have not heard those terms before. Your computer’s hard drive (or SSD) will store data which will be remembered even after you turn off your computer. The process of getting data from, and putting data on your hard drive (or SSD) is called reading and writing.