# 4 Subsetting

## 4.1 Selecting multiple elements

1. Q: Fix each of the following common data frame subsetting errors:

mtcars[mtcars$cyl = 4, ] # use == (instead of =) mtcars[-1:4, ] # use -(1:4) (instead of -1:4) mtcars[mtcars$cyl <= 5]        # , is missing
mtcars[mtcars$cyl == 4 | 6, ] # use mtcars$cyl == 6 (instead of 6)
2. Q: Why does the following code yield five missing values? (Hint: why is it different from x[NA_real_]?)

x <- 1:5
x[NA]
#> [1] NA NA NA NA NA

A: NA has logical type and and internally x[NA] will be recycled to x[NA, NA, NA, NA, NA]. Subsetting an atomic with NA returns NA which occurs five times. That is why five missing values are returned.

3. Q: What does upper.tri() return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?

x <- outer(1:5, 1:5, FUN = "*")
x[upper.tri(x)]

A: upper.tri() returns a logical matrix containing TRUE for all upper diagonal elements and FALSE otherwise. The implementation of upper.tri() is straightforward, but quite interesting as it uses .row(dim(x)) <= .col(dim(x)) to create the logical matrix. Its subsetting-behaviour will be identical to subsetting with logical matrices, where all elements that correspond to TRUE will be selected. We don’t need to treat this form of subsetting in a special way.

4. Q: Why does mtcars[1:20] return an error? How does it differ from the similar mtcars[1:20, ]?

A: mtcars[1:20] is subsetted with a vector and in general this statement would return a data frame of the first 20 columns of the dataset. But mtcars has only 11 columns, so the index will be out of bounds and an error is thrown. mtcars[1:20, ] is subsetted with two vectors and the first 20 rows of all columns will be returned.

5. Q: Implement your own function that extracts the diagonal entries from a matrix (it should behave like diag(x) where x is a matrix).

A: The elements in the diagonal of a matrix have the same row- and column indices. This characteristic can be used to create a suitable numeric matrix used for subsetting.

diag2 <- function(x){
n <- min(dim(x))
indices <- seq_len(n)
diag_matrix <- matrix(rep(indices, 2), ncol = 2)

x[diag_matrix]
}

# Let's check if it works
(x <- matrix(1:30, 5))
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]    1    6   11   16   21   26
#> [2,]    2    7   12   17   22   27
#> [3,]    3    8   13   18   23   28
#> [4,]    4    9   14   19   24   29
#> [5,]    5   10   15   20   25   30

diag(x)
#> [1]  1  7 13 19 25
diag2(x)
#> [1]  1  7 13 19 25
6. Q: What does df[is.na(df)] <- 0 do? How does it work?

A: This expression replaces the NAs in df with 0. Here is.na(df) returns a logical matrix and encodes the position of the missing values in df. Subsetting and assignment are then combined to replace only the missing values.

## 4.2 Selecting a single element

1. Q: Brainstorm as many ways as possible to extract the third value from the cyl variable in the mtcars dataset.

A: Base R already provides an abundance of possibilities:

# using [[3]] instead of [3] would also work in these examples
mtcars$cyl[3] #> [1] 4 mtcars[ , "cyl"][3] #> [1] 4 mtcars[["cyl"]][3] #> [1] 4 with(mtcars, cyl[3]) #> [1] 4 mtcars[3, 2] #> [1] 4 mtcars[3, ]$cyl
#> [1] 4
mtcars[3, "cyl"]
#> [1] 4
mtcars[3, ][ , "cyl"]
#> [1] 4
mtcars[3, ][["cyl"]]
#> [1] 4
with(mtcars[3, ], cyl)
#> [1] 4

tail(head(mtcars, 3), 1)$cyl #> [1] 4 head(tail(mtcars, 30), 1)$cyl  # not very practical ;)
#> [1] 4

subset(mtcars, rownames(mtcars) == "Datsun 710")$cyl #> [1] 4 When we turn to other libraries, e.g. the tidyverse packages, even more possibilities open up. As an example: library(magrittr) mtcars %>% purrr::pluck("cyl", 3) #> [1] 4 mtcars %>% dplyr::pull(cyl) %>% purrr::pluck(3) #> [1] 4 2. Q: Given a linear model, e.g., mod <- lm(mpg ~ wt, data = mtcars), extract the residual degrees of freedom. Extract the R squared from the model summary (summary(mod)). A: mod has the type list, which opens up several possibilities: mod <- lm(mpg ~ wt, data = mtcars) mod$df.residual       # output preserved
#> [1] 30
mod$df.res # $ allows partial matching
#> [1] 30
mod["df.residual"]    # list output
#> $df.residual #> [1] 30 mod[["df.residual"]] # output preserved #> [1] 30 The same also applies to summary(mod), so we could use i.e.: summary(mod)$r.squared

(Tip: The broom-package provides a very useful approach to work with models in a tidy way).

## 4.3 Applications

1. Q: How would you randomly permute the columns of a data frame? (This is an important technique in random forests). Can you simultaneously permute the rows and columns in one step?

A: This can be achieved by combining [ and sample():

# Permute columns
iris[sample(ncol(iris))]

# Permute columns and rows in one step
iris[sample(nrow(iris)), sample(ncol(iris)), drop = FALSE]
2. Q: How would you select a random sample of m rows from a data frame? What if the sample had to be contiguous (i.e., with an initial row, a final row, and every row in between)?

A: Selecting m random rows from a data frame can be achieved through subsetting.

m = 10
iris[sample(nrow(iris), m), ]

Keeping subsequent rows together as a “blocked sample” requires only some caution to get the start- and end-index correct.

start <- sample(nrow(iris) - m + 1, 1)
end <- start + m - 1
iris[start:end, , drop = FALSE]
3. Q: How could you put the columns in a data frame in alphabetical order?

A: We first sort the column names alphabetically and use this vector to subset the data frame:

iris[sort(names(iris))]