5.5 Indexing
Part of doing interesting things with data is being able to select just the data that you need for a particular circumstance.
You’ve already seen how to get a particular element from a vector or matrix, or a specific component from a list, using indices.
A datum’s index is its position in the vector or list.
For example, to get the second element of a vector A
, we use the index 2 in square brackets: A[2]
.
The process of selecting elements using their indices is called indexing, and R provides multiple ways of indexing vectors.
Below we’ll cover some basic indexing and more advanced indexing for the different data structures in R.
Download the progress check 4 template and follow the instructions. That document should include all progress reports from Section 5.5 through (and including) Section 6.1.
5.5.1 Vectors
Let’s define a vector and access an element in the way you already know:
# Create an example vector
V <- c("A", "B", "C", "D", "E", "F", "G", "H", "I")
# Access the 5th element
V[5]
[1] "E"
Unlike many other languages, R indices start with 1, not 0! so the
first element is accessed as A[1]
, etc.
Here are some other ways you can index as well. You can access multiple indices at the same time using a numeric vector of indices:
[1] "A" "B" "E"
If you need to access several indices in a row, you can use a colon (:
):
[1] "B" "C" "D" "E" "F" "G"
You can even combine these two methods:
[1] "A" "B" "C" "F"
Note that the following are all equivalent ways to access the first three elements of V
:
V[1:3]
V[c(1,2,3)]
V[c(1:3)]
V[c(1:2,3)]
- can you think of another example?
But the first way would probably be the most clear for someone else to understand. All of these methods can work with assignment as well:
[1] "X" "B" "C" "D" "E" "F" "X" "X" "X"
Even though these examples use a character vector, this indexing works on vectors of any type.
5.5.2 Matrices
To access an element of a matrix, we have to specify the row and the column.
Let’s create a matrix from the V
vector and access one of its elements:
[,1] [,2] [,3]
[1,] "X" "D" "X"
[2,] "B" "E" "X"
[3,] "C" "F" "X"
[1] "D"
Recall that we can access an entire row or column by leaving the other index blank:
[1] "X" "D" "X"
[1] "D" "E" "F"
But any of the indexing we just used for vectors can also be used on matrices
[,1] [,2]
[1,] "D" "X"
[2,] "E" "X"
Finally, there is one more way of indexing Matrices (for now), that provides only one index:
[1] "E"
If you give one index, then R will count down the first row, then the second, then the third, etc., until it reaches the index you specified.
Notice how this agrees with the 5th element of the vector V
, which was used to make our matrix!
And finally, as before, any of these indexing methods can be used to change an element’s value:
[,1] [,2] [,3]
[1,] "X" "D" "X"
[2,] "Hats" "Hats" "Hats"
[3,] "C" "F" "X"
5.5.3 Lists
So far we’ve discussed three different ways of accessing elements in a list:
[1] "apple"
[1] "banana"
[1] "cherry"
And these are basically the only options. Unfortunately, you cannot use a vector of indices in order to access multiple list components at once:
Error in L[[1:3]]: recursive indexing failed at level 2
L[[1:3]]
actually does (as the error message might suggest), is access elements within a nested list, but that is beyond the scope of this class.
Create a vector containing the numbers 1 through 1000 in order (hint: try using 1:1000 on the right of the assignment operator). Then, change elements 4, 196, and 501 through 556 to “brussel sprouts”. What happened to the other elements in the vector?
5.5.4 Data Frames
Remember that data frames are just lists of vectors, so the same indexing rules for lists and vectors apply. But remember that matrix indexing rules also apply! Here are some examples with the Olympic athletes data.
athletes3 <- athletes[1:20, 1:5] # Get the first 20 rows and first 5 columns, and assign it to athletes3
[1] "Berta Hrub"
[2] "Joaquim Vital"
[3] "Madelon Baans"
[4] "Achille Canna"
[5] "Antje Buschschulte (-Meeuw)"
[6] "Ludwig Gredler"
[7] "Pawe Abratkiewicz"
[8] "Jerzy Zdzisaw Janikowski"
[9] "Giuseppe \"Peppino\" Tanti"
[10] "Carl-Jan Gustaf David Hamilton"
[11] "Bla Nagy"
[12] "Vincent Vittoz"
[13] "Joyce May Racek (-Markley, -Budrunas)"
[14] "Seiichiro Kashio"
[15] "Surzer"
[16] "Dimitrios Kantzias"
[17] "Kim Gwang-Suk"
[18] "Joshua Noel \"Josh\" Laban"
[19] "Alejandro Vidal Arellano"
[20] "Mariusz Latkowski"
Remember that each column of a data frame is just a vector, so we can use list indexing to access the Name
column, then immediately use vector indexing to get only the indices that we want:
[1] "Berta Hrub" "Joaquim Vital" "Madelon Baans"
Notice how with lists, you cannot access multiple components (which is what data frame columns are) at the same time, but with matrices you can access multiple columns at once. Since data frames can use matrix formatting, you can select multiple columns at once, as the first example above showed.
You can also access columns by name like so:
(and if your rows have names, you can access rows by name as well).
Using the mtcars data frame (included in R), get the mpg
for the cars in rows 15 through 20, and assign it to a vector. Now find
the average mpg
of those cars.
a <- c(1, 2, 3)
and set the names with names(a) <- c("angus", "brillow", "chandelier")
, then see what happens if you type a[["angus"]]
! Matrices can also be accessed using names as well.
5.5.5 Advanced Indexing
There are even more ways to select the data you need from your R data structures, let’s look at some more advanced techniques.
5.5.5.1 Logical Based Indexing
One very useful method that R provides is to access elements of a vector using a different, logical vector of the same length. As the following example will show, R will give only the elements which are true in the logical vector:
v <- c("alpha", "bravo", "charlie", "delta") # The vector we want to access
i <- c(FALSE, TRUE, FALSE, TRUE) # The logical vector we'll use to index
# Index v using i:
v[i]
[1] "bravo" "delta"
Why is this so useful? Remember that you can create logical vectors by comparing any type of vector to some value:
[1] FALSE FALSE FALSE TRUE
This means you can create a logical vector in order to extract only the elements of a vector which match some criterion. For example, let’s create a logical vector based on whether an Olympic athlete’s sport was “Tug-Of-War”.
plays_tug_of_war <- athletes$Sport == "Tug-Of-War" # Create logical vector
sum(plays_tug_of_war) # Count how many TRUEs
[1] 5
Now let’s use that logical vector to get the names of the athletes:
[1] "Edgar Lindenau Aabye" "Willie Slade" "William Hirons"
[4] "Ernest Walter Ebbage" "William Penn"
Using the Olympic athletes data, create a logical vector which is true when an athlete’s sport is wrestling. Then access the age of all wrestlers, and assign the ages to another vector. Finally, compute the average age of the wrestlers vector (remember, there may be duplicate athlete names, so this average won’t mean much; the emphasis is on indexing right now)
Logical vectors can also be used to subset a data frame based on some condition. That is, we take entire rows which meet a condition, rather than just a single variable. For example:
# Subset the athletes data frame to get only Summer athletes.
athletes_summer <- athletes[athletes$Season == "Summer",]
,
), which indicates that we’re indexing rows, not columns, of the data frame.
You can specify multiple conditions using “and” (&
) and “or” (|
) like this:
# Create logical vector which is true for female gymnasts (female AND gymnast)
index <- (athletes$Sex == "F") & (athletes$Sport == "Gymnastics")
# Select only female gymnasts
fem_gym <- athletes[index,]
# Create logical vector which is true if sport is basketball OR gymnastics
index <- (athletes$Sport == "Basketball") | (athletes$Sport == "Gymnastics")
# Select athletes whose sport is basketball OR gymnastics.
bb_gy <- athletes[index,]
Season
is “Winter”
Another common use for Logical indexing is filling in missing values.
As part of data cleaning, you may decide to change NA
’s to some other value. This is easy since we can create a logical vector which is TRUE
when a value is NA
.
We can do this with the is.na
function:
# For athletes with no medal, replace `NA` with "No Medal"
athletes$Medal[is.na(athletes$Medal)] <- "No Medal"
Run the above code to replace NA values with “No Medal”, and save the file in your data_clean folder as “athletes_clean.csv”
This is not an endorsement of a particular approach to handling missing values. There are many situation dependent considerations that should be made in order to decide the best thing to do.
5.5.5.2 Negative Indexing
Sometimes it’s easier to specify which columns or rows should be excluded from indexing, rather than those that should be included. To select every column except the first one, you can use a negative index:
This also works with numeric vectors:
5.5.5.3 Nested Indexing: [[1]][3]
In R, it’s likely that at some point you will encounter nested data structures, such as vectors within lists (data frames!) and lists within lists. This might make indexing more confusing at first, but a little practice will help you keep things organized in your mind. Consider the following example:
# Create a vector and a matrix
V <- 1:16
M <- matrix(V, 4, 4)
# Create a list which contains them:
L <- list(V, M)
# Create a character vector
C <- c("I", "think", "therefore", "I", "am")
# Create another list which will contain L and C:
L2 <- list(L, C)
With lists like this, it’s easy to see code like L2[[1]][[2]][2,3]
and get confused about what is happening.
It’s best to break down the statement from left to right
L2 # The second list, L2
L2[[1]] # The first component of L2, which is the first list, L
L2[[1]][[2]] # The second element of L, which is the matrix M
L2[[1]][[2]][2,3] # The second row and third column of M.
We have discussed quite a few ways to index data, but rest assured there are more ways that we did not discuss! We won’t address them now, to keep things simple!
Any feedback for this section? Click here