5.5 Indexing

Part of doing interesting things with data is being able to select just the data that you need for a particular circumstance. You’ve already seen how to get a particular element from a vector or matrix, or a specific component from a list, using indices. A datum’s index is its position in the vector or list. For example, to get the second element of a vector A, we use the index 2 in square brackets: A[2]. The process of selecting elements using their indices is called indexing, and R provides multiple ways of indexing vectors. Below we’ll cover some basic indexing and more advanced indexing for the different data structures in R.

Download the progress check 4 template and follow the instructions. That document should include all progress reports from Section 5.5 through (and including) Section 6.1.

5.5.1 Vectors

Let’s define a vector and access an element in the way you already know:

# Create an example vector
V <- c("A", "B", "C", "D", "E", "F", "G", "H", "I")

# Access the 5th element
V[5]
[1] "E"

Unlike many other languages, R indices start with 1, not 0! so the first element is accessed as A[1], etc.

Here are some other ways you can index as well. You can access multiple indices at the same time using a numeric vector of indices:

V[c(1, 2, 5)]  # Access elements 1, 2, and 5
[1] "A" "B" "E"

If you need to access several indices in a row, you can use a colon (:):

V[2:7]  # Access elements 2 through 7
[1] "B" "C" "D" "E" "F" "G"

You can even combine these two methods:

V[c(1:3, 6)]  # Access elements 1, 2, 3, and 6
[1] "A" "B" "C" "F"

Note that the following are all equivalent ways to access the first three elements of V:

  • V[1:3]
  • V[c(1,2,3)]
  • V[c(1:3)]
  • V[c(1:2,3)]
  • can you think of another example?

But the first way would probably be the most clear for someone else to understand. All of these methods can work with assignment as well:

V[c(1, 7:9)] <- "X"  # Change elements 1, 7, 8, and 9 to "X"
V
[1] "X" "B" "C" "D" "E" "F" "X" "X" "X"

Even though these examples use a character vector, this indexing works on vectors of any type.

5.5.2 Matrices

To access an element of a matrix, we have to specify the row and the column. Let’s create a matrix from the V vector and access one of its elements:

M <- matrix(V, 3, 3)  # Create matrix M with data from vector V
M
     [,1] [,2] [,3]
[1,] "X"  "D"  "X" 
[2,] "B"  "E"  "X" 
[3,] "C"  "F"  "X" 
M[1,2]  # Access the element in row 1, column 2
[1] "D"

Recall that we can access an entire row or column by leaving the other index blank:

M[1,]  # Access the entire first row
[1] "X" "D" "X"
M[,2]  # Access the entire second column
[1] "D" "E" "F"

But any of the indexing we just used for vectors can also be used on matrices

M[1:2, c(2, 3)]  # Access the elements in rows 1 and 2, columns 2 and 3.
     [,1] [,2]
[1,] "D"  "X" 
[2,] "E"  "X" 

Finally, there is one more way of indexing Matrices (for now), that provides only one index:

M[5]  # Access the "5th" element of the matrix
[1] "E"

If you give one index, then R will count down the first row, then the second, then the third, etc., until it reaches the index you specified. Notice how this agrees with the 5th element of the vector V, which was used to make our matrix! And finally, as before, any of these indexing methods can be used to change an element’s value:

M[2, 1:3] <- "Hats"
M
     [,1]   [,2]   [,3]  
[1,] "X"    "D"    "X"   
[2,] "Hats" "Hats" "Hats"
[3,] "C"    "F"    "X"   

5.5.3 Lists

So far we’ve discussed three different ways of accessing elements in a list:

L <- list(A = "apple", b = "banana", c = "cherry")

L[[1]]  # Access using index number
[1] "apple"
L[["b"]]  # Access using component name
[1] "banana"
L$c  # Access using component name and dollar sign notation
[1] "cherry"

And these are basically the only options. Unfortunately, you cannot use a vector of indices in order to access multiple list components at once:

L[[1:3]]  # This does not work
Error in L[[1:3]]: recursive indexing failed at level 2
What L[[1:3]] actually does (as the error message might suggest), is access elements within a nested list, but that is beyond the scope of this class.

Create a vector containing the numbers 1 through 1000 in order (hint: try using 1:1000 on the right of the assignment operator). Then, change elements 4, 196, and 501 through 556 to “brussel sprouts”. What happened to the other elements in the vector?

5.5.4 Data Frames

Remember that data frames are just lists of vectors, so the same indexing rules for lists and vectors apply. But remember that matrix indexing rules also apply! Here are some examples with the Olympic athletes data.

athletes3 <- athletes[1:20, 1:5]  # Get the first 20 rows and first 5 columns, and assign it to athletes3
athletes3$Name  # Get the Name column
 [1] "Berta Hrub"                           
 [2] "Joaquim Vital"                        
 [3] "Madelon Baans"                        
 [4] "Achille Canna"                        
 [5] "Antje Buschschulte (-Meeuw)"          
 [6] "Ludwig Gredler"                       
 [7] "Pawe Abratkiewicz"                    
 [8] "Jerzy Zdzisaw Janikowski"             
 [9] "Giuseppe \"Peppino\" Tanti"           
[10] "Carl-Jan Gustaf David Hamilton"       
[11] "Bla Nagy"                             
[12] "Vincent Vittoz"                       
[13] "Joyce May Racek (-Markley, -Budrunas)"
[14] "Seiichiro Kashio"                     
[15] "Surzer"                               
[16] "Dimitrios Kantzias"                   
[17] "Kim Gwang-Suk"                        
[18] "Joshua Noel \"Josh\" Laban"           
[19] "Alejandro Vidal Arellano"             
[20] "Mariusz Latkowski"                    

Remember that each column of a data frame is just a vector, so we can use list indexing to access the Name column, then immediately use vector indexing to get only the indices that we want:

athletes3$Name[1:3]  # Get the first three elements of the Name column
[1] "Berta Hrub"    "Joaquim Vital" "Madelon Baans"

Notice how with lists, you cannot access multiple components (which is what data frame columns are) at the same time, but with matrices you can access multiple columns at once. Since data frames can use matrix formatting, you can select multiple columns at once, as the first example above showed.

You can also access columns by name like so:

athletes3[,c("Name", "Sex")]  # Access Name and Sex columns

(and if your rows have names, you can access rows by name as well).

Using the mtcars data frame (included in R), get the mpg for the cars in rows 15 through 20, and assign it to a vector. Now find the average mpg of those cars.

Think it’s weird that data frames can be indexed like matrices? It gets weirder. When vectors have names, they can be indexed like lists! Try for yourself: create a vector a <- c(1, 2, 3) and set the names with names(a) <- c("angus", "brillow", "chandelier") , then see what happens if you type a[["angus"]]! Matrices can also be accessed using names as well.

5.5.5 Advanced Indexing

There are even more ways to select the data you need from your R data structures, let’s look at some more advanced techniques.

5.5.5.1 Logical Based Indexing

One very useful method that R provides is to access elements of a vector using a different, logical vector of the same length. As the following example will show, R will give only the elements which are true in the logical vector:

v <- c("alpha", "bravo", "charlie", "delta")  # The vector we want to access
i <- c(FALSE, TRUE, FALSE, TRUE)  # The logical vector we'll use to index

# Index v using i:
v[i]
[1] "bravo" "delta"

Why is this so useful? Remember that you can create logical vectors by comparing any type of vector to some value:

v == "delta"
[1] FALSE FALSE FALSE  TRUE

This means you can create a logical vector in order to extract only the elements of a vector which match some criterion. For example, let’s create a logical vector based on whether an Olympic athlete’s sport was “Tug-Of-War”.

plays_tug_of_war <- athletes$Sport == "Tug-Of-War"  # Create logical vector

sum(plays_tug_of_war)  # Count how many TRUEs
[1] 5

Now let’s use that logical vector to get the names of the athletes:

athletes$Name[plays_tug_of_war]
[1] "Edgar Lindenau Aabye" "Willie Slade"         "William Hirons"      
[4] "Ernest Walter Ebbage" "William Penn"        

Using the Olympic athletes data, create a logical vector which is true when an athlete’s sport is wrestling. Then access the age of all wrestlers, and assign the ages to another vector. Finally, compute the average age of the wrestlers vector (remember, there may be duplicate athlete names, so this average won’t mean much; the emphasis is on indexing right now)

Logical vectors can also be used to subset a data frame based on some condition. That is, we take entire rows which meet a condition, rather than just a single variable. For example:

# Subset the athletes data frame to get only Summer athletes.
athletes_summer <- athletes[athletes$Season == "Summer",]
In the last example, we are creating the logical vector and immediately using it to index the rows. Pause and think through what’s happening in this example if it’s not quite clear. Also note the placement of the comma (,), which indicates that we’re indexing rows, not columns, of the data frame.

You can specify multiple conditions using “and” (&) and “or” (|) like this:

# Create logical vector which is true for female gymnasts (female AND gymnast)
index <- (athletes$Sex == "F") & (athletes$Sport == "Gymnastics")

# Select only female gymnasts
fem_gym <- athletes[index,]
# Create logical vector which is true if sport is basketball OR gymnastics
index <- (athletes$Sport == "Basketball") | (athletes$Sport == "Gymnastics")

# Select athletes whose sport is basketball OR gymnastics. 
bb_gy <- athletes[index,]
Create a data frame called athletes_winter with only the rows whose Season is “Winter”

Another common use for Logical indexing is filling in missing values. As part of data cleaning, you may decide to change NA’s to some other value. This is easy since we can create a logical vector which is TRUE when a value is NA. We can do this with the is.na function:

# For athletes with no medal, replace `NA` with "No Medal"
athletes$Medal[is.na(athletes$Medal)] <- "No Medal"

Run the above code to replace NA values with “No Medal”, and save the file in your data_clean folder as “athletes_clean.csv”

This is not an endorsement of a particular approach to handling missing values. There are many situation dependent considerations that should be made in order to decide the best thing to do.

5.5.5.2 Negative Indexing

Sometimes it’s easier to specify which columns or rows should be excluded from indexing, rather than those that should be included. To select every column except the first one, you can use a negative index:

athletes[,-1]  # Leave out the ID column

This also works with numeric vectors:

athletes[-c(1:10),]  # Access all but the first 10 rows.

5.5.5.3 Nested Indexing: [[1]][3]

In R, it’s likely that at some point you will encounter nested data structures, such as vectors within lists (data frames!) and lists within lists. This might make indexing more confusing at first, but a little practice will help you keep things organized in your mind. Consider the following example:

# Create a vector and a matrix
V <- 1:16
M <- matrix(V, 4, 4)

# Create a list which contains them:
L <- list(V, M)

# Create a character vector
C <- c("I", "think", "therefore", "I", "am")

# Create another list which will contain L and C:
L2 <- list(L, C)

With lists like this, it’s easy to see code like L2[[1]][[2]][2,3] and get confused about what is happening. It’s best to break down the statement from left to right

L2                 # The second list, L2
L2[[1]]            # The first component of L2, which is the first list, L
L2[[1]][[2]]       # The second element of L, which is the matrix M
L2[[1]][[2]][2,3]  # The second row and third column of M.

We have discussed quite a few ways to index data, but rest assured there are more ways that we did not discuss! We won’t address them now, to keep things simple!