5.4 Data Formatting

Before we continue working with data, here are a few comments about data formatting. Many data sets can be thought of as one or more observations of one or more variables. R functions work best when the data are formatted into rows and columns, so that each row is an observation, and each column is a variable. Unfortunately, sometimes data do not follow this convention, or worse, it may not be clear what the observations or variables are. Working with data often involves answering multiple questions, and different questions may require thinking of observations and variables differently. In R, there are ways of changing the structure of data to suit your particular needs, but these are intermediate topics which will not be discussed here.

One concept related to data organization is called “Tidy Data”, which you can read more about here. This focus on tidyness has led to the development of a set of R packages collectively called the “tidyverse”, which has become very popular for R analysis. The tidyverse will not be covered in this class, but a later module will provide extensive experience with it.

5.4.1 “Raw” data vs. “Clean” data.

Why is there a “data_raw” folder and a “data_clean” folder, and not just a “data” folder? The idea is that the data_raw folder contains all of the original data sets that you start with, before any cleaning or summarization take place, and any cleaned, modified, or created data sets that result from your data analysis should be stored in the “data_clean” folder (or possibly even a “results” folder). This distinction ensures that the original data sets are preserved in their unedited state, just in case you need to start over from the beginning to answer a different question, and in order for others to easily replicate your work. This is why the data in the folder should be thought of as read only. Sometimes people even modify the permissions of the raw data files on their computer to prevent anyone from accidentally deleting or overwriting the raw data. The “clean” moniker comes from the fact that often times data sets need some “cleaning” such as removing duplicates, removing NA values, discarding irrelevant data, etc. There are many other ways of organizing data, but the principle here is to separate the original data sets from any intermediate data sets.

Perhaps you’ve never thought about how data should be structured. Consider an experiment which measures the temperatures of five guinea pigs for each of four different days. Think about organizing each row to be a guinea pig and each column to be a day. Can you think of an R function to compute the average temperature on day 1? How about the average temperature for guinea pig 3? How do your answers change if the data are arranged with days as the rows and guinea pigs as columns? Can you think of another way to organize the data?

This is the last section you should include in Progress Check 3. Knit your output document and submit on Canvas.