3 Subsetting

3.1 Selecting multiple elements

Q1: Fix each of the following common data frame subsetting errors:

mtcars[mtcars$cyl = 4, ]
# use `==`              (instead of `=`)

mtcars[-1:4, ]
# use `-(1:4)`          (instead of `-1:4`)

mtcars[mtcars$cyl <= 5]
# `,` is missing

mtcars[mtcars$cyl == 4 | 6, ]
# use `mtcars$cyl == 6` (instead of `6`)
#  or `%in% c(4, 6)`    (instead of `== 4 | 6`)

Q2: Why does the following code yield five missing values? (Hint: why is it different from x[NA_real_]?)

x <- 1:5
x[NA]
#> [1] NA NA NA NA NA

A: In contrast to NA_real, NA has logical type and logical vectors are recycled to the same length as the vector being subset, i.e. x[NA] is recycled to x[NA, NA, NA, NA, NA].

Q3: What does upper.tri() return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?

x <- outer(1:5, 1:5, FUN = "*")
x[upper.tri(x)]

A: upper.tri(x) returns a logical matrix, which contains TRUE values above the diagonal and FALSE values everywhere else. In upper.tri() the positions for TRUE and FALSE values are determined by comparing x’s row and column indices via .row(dim(x)) < .col(dim(x)).

x
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    2    3    4    5
#> [2,]    2    4    6    8   10
#> [3,]    3    6    9   12   15
#> [4,]    4    8   12   16   20
#> [5,]    5   10   15   20   25
upper.tri(x)
#>       [,1]  [,2]  [,3]  [,4]  [,5]
#> [1,] FALSE  TRUE  TRUE  TRUE  TRUE
#> [2,] FALSE FALSE  TRUE  TRUE  TRUE
#> [3,] FALSE FALSE FALSE  TRUE  TRUE
#> [4,] FALSE FALSE FALSE FALSE  TRUE
#> [5,] FALSE FALSE FALSE FALSE FALSE

When subsetting with logical matrices, all elements that correspond to TRUE will be selected. Matrices extend vectors with a dimension attribute, so the vector forms of subsetting can be used (including logical subsetting). We should take care, that the dimensions of the subsetting matrix match the object of interest — otherwise unintended selections due to vector recycling may occur. Please also note, that this form of subsetting returns a vector instead of a matrix, as the subsetting alters the dimensions of the object.

x[upper.tri(x)]
#>  [1]  2  3  6  4  8 12  5 10 15 20

Q4: Why does mtcars[1:20] return an error? How does it differ from the similar mtcars[1:20, ]?

A: When subsetting a data frame with a single vector, it behaves the same way as subsetting a list of columns. So, mtcars[1:20] would return a data frame containing the first 20 columns of the dataset. However, as mtcars has only 11 columns, the index will be out of bounds and an error is thrown. mtcars[1:20, ] is subsetted with two vectors, so 2d subsetting kicks in, and the first index refers to rows.

Q5: Implement your own function that extracts the diagonal entries from a matrix (it should behave like diag(x) where x is a matrix).

A: The elements in the diagonal of a matrix have the same row- and column indices. This characteristic can be used to create a suitable numeric matrix used for subsetting.

diag2 <- function(x) {
  n <- min(nrow(x), ncol(x))
  idx <- cbind(seq_len(n), seq_len(n))

  x[idx]
}

# Let's check if it works
(x <- matrix(1:30, 5))
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]    1    6   11   16   21   26
#> [2,]    2    7   12   17   22   27
#> [3,]    3    8   13   18   23   28
#> [4,]    4    9   14   19   24   29
#> [5,]    5   10   15   20   25   30

diag(x)
#> [1]  1  7 13 19 25
diag2(x)
#> [1]  1  7 13 19 25

Q6: What does df[is.na(df)] <- 0 do? How does it work?

A: This expression replaces the NAs in df with 0. Here is.na(df) returns a logical matrix that encodes the position of the missing values in df. Subsetting and assignment are then combined to replace only the missing values.

3.2 Selecting a single element

Q1: Brainstorm as many ways as possible to extract the third value from the cyl variable in the mtcars dataset.

A: Base R already provides an abundance of possibilities:

# Select column first
mtcars$cyl[[3]]
#> [1] 4
mtcars[ , "cyl"][[3]]
#> [1] 4
mtcars[["cyl"]][[3]]
#> [1] 4
with(mtcars, cyl[[3]])
#> [1] 4

# Select row first
mtcars[3, ]$cyl
#> [1] 4
mtcars[3, "cyl"]
#> [1] 4
mtcars[3, ][ , "cyl"]
#> [1] 4
mtcars[3, ][["cyl"]]
#> [1] 4

# Select simultaneously
mtcars[3, 2]
#> [1] 4
mtcars[[c(2, 3)]]
#> [1] 4

Q2: Given a linear model, e.g. mod <- lm(mpg ~ wt, data = mtcars), extract the residual degrees of freedom. Extract the R squared from the model summary (summary(mod)).

A: mod is of type list, which opens up several possibilities. We use $ or [[ to extract a single element:

mod <- lm(mpg ~ wt, data = mtcars)

mod$df.residual
#> [1] 30
mod[["df.residual"]]
#> [1] 30

The same also applies to summary(mod), so we could use, e.g.:

summary(mod)$r.squared
#> [1] 0.753

(Tip: The {broom} package11 provides a very useful approach to work with models in a tidy way.)

3.3 Applications

Q1: How would you randomly permute the columns of a data frame? (This is an important technique in random forests.) Can you simultaneously permute the rows and columns in one step?

A: This can be achieved by combining [ and sample():

# Permute columns
mtcars[sample(ncol(mtcars))]

# Permute columns and rows in one step
mtcars[sample(nrow(mtcars)), sample(ncol(mtcars))]

Q2: How would you select a random sample of m rows from a data frame? What if the sample had to be contiguous (i.e. with an initial row, a final row, and every row in between)?

A: Selecting m random rows from a data frame can be achieved through subsetting.

m <- 10
mtcars[sample(nrow(mtcars), m), ]

Holding successive lines together as a blocked sample requires only a certain amount of caution in order to obtain the correct start and end index.

start <- sample(nrow(mtcars) - m + 1, 1)
end <- start + m - 1
mtcars[start:end, , drop = FALSE]

Q3: How could you put the columns in a data frame in alphabetical order?

A: We combine [ with order() or sort():

mtcars[order(names(mtcars))]
mtcars[sort(names(mtcars))]