# 4 Subsetting

## 4.1 Selecting multiple elements

1. Q: Fix each of the following common data frame subsetting errors:

mtcars[mtcars$cyl = 4, ] # use == (instead of =) mtcars[-1:4, ] # use -(1:4) (instead of -1:4) mtcars[mtcars$cyl <= 5]        # , is missing
mtcars[mtcars$cyl == 4 | 6, ] # use mtcars$cyl == 6 (instead of 6)
2. Q: Why does the following code yield five missing values? (Hint: why is it different from x[NA_real_]?)

x <- 1:5
x[NA]
#>  NA NA NA NA NA

A: NA has logical type and logical vectors are recycled to the same length as the vector being subset, i.e. x[NA] is recycled to x[NA, NA, NA, NA, NA].

3. Q: What does upper.tri() return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?

x <- outer(1:5, 1:5, FUN = "*")
x[upper.tri(x)]

A: upper.tri() returns a logical matrix containing TRUE for all upper diagonal elements and FALSE otherwise. The implementation of upper.tri() is straightforward, but quite interesting as it uses .row(dim(x)) <= .col(dim(x)) to create the logical matrix. Its subsetting-behaviour will be identical to subsetting with logical matrices, where all elements that correspond to TRUE will be selected. We don’t need to treat this form of subsetting in a special way.

4. Q: Why does mtcars[1:20] return an error? How does it differ from the similar mtcars[1:20, ]?

A: When subsetting a data frame with a single vector, it behaves the same way as subsetting a list of the columns, so mtcars[1:20] would return a data frame of the first 20 columns of the dataset. But mtcars has only 11 columns, so the index will be out of bounds and an error is thrown. mtcars[1:20, ] is subsetted with two vectors, so 2d subsetting kicks in, and the first index refers to rows.

5. Q: Implement your own function that extracts the diagonal entries from a matrix (it should behave like diag(x) where x is a matrix).

A: The elements in the diagonal of a matrix have the same row- and column indices. This characteristic can be used to create a suitable numeric matrix used for subsetting.

diag2 <- function(x){
n <- min(nrow(x), ncol(x))
idx <- cbind(seq_len(n), seq_len(n))

x[idx]
}

# Let's check if it works
(x <- matrix(1:30, 5))
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]    1    6   11   16   21   26
#> [2,]    2    7   12   17   22   27
#> [3,]    3    8   13   18   23   28
#> [4,]    4    9   14   19   24   29
#> [5,]    5   10   15   20   25   30

diag(x)
#>   1  7 13 19 25
diag2(x)
#>   1  7 13 19 25
6. Q: What does df[is.na(df)] <- 0 do? How does it work?

A: This expression replaces the NAs in df with 0. Here is.na(df) returns a logical matrix that encodes the position of the missing values in df. Subsetting and assignment are then combined to replace only the missing values.

## 4.2 Selecting a single element

1. Q: Brainstorm as many ways as possible to extract the third value from the cyl variable in the mtcars dataset.

A: Base R already provides an abundance of possibilities:

# Select column first
mtcars$cyl[] #>  4 mtcars[ , "cyl"][] #>  4 mtcars[["cyl"]][] #>  4 with(mtcars, cyl[]) #>  4 # Select row first mtcars[3, ]$cyl
#>  4
mtcars[3, "cyl"]
#>  4
mtcars[3, ][ , "cyl"]
#>  4
mtcars[3, ][["cyl"]]
#>  4

# Select simultaneously
mtcars[3, 2]
#>  4
mtcars[[c(2, 3)]]
#>  4
2. Q: Given a linear model, e.g., mod <- lm(mpg ~ wt, data = mtcars), extract the residual degrees of freedom. Extract the R squared from the model summary (summary(mod)).

A: mod has the type list, which opens up several possibilities:

mod <- lm(mpg ~ wt, data = mtcars)

mod$df.residual # output preserved #>  30 mod$df.res            # $ allows partial matching #>  30 mod["df.residual"] # list output #>$df.residual
#>  30
mod[["df.residual"]]  # output preserved
#>  30

The same also applies to summary(mod), so we could use i.e.:

summary(mod)\$r.squared

(Tip: The broom-package provides a very useful approach to work with models in a tidy way).

## 4.3 Applications

1. Q: How would you randomly permute the columns of a data frame? (This is an important technique in random forests). Can you simultaneously permute the rows and columns in one step?

A: This can be achieved by combining [ and sample():

# Permute columns
iris[sample(ncol(iris))]

# Permute columns and rows in one step
iris[sample(nrow(iris)), sample(ncol(iris)), drop = FALSE]
2. Q: How would you select a random sample of m rows from a data frame? What if the sample had to be contiguous (i.e., with an initial row, a final row, and every row in between)?

A: Selecting m random rows from a data frame can be achieved through subsetting.

m <- 10
iris[sample(nrow(iris), m), , drop = FALSE]

Keeping subsequent rows together as a “blocked sample” requires only some caution to get the start- and end-index correct.

start <- sample(nrow(iris) - m + 1, 1)
end <- start + m - 1
iris[start:end, , drop = FALSE]
3. Q: How could you put the columns in a data frame in alphabetical order?

A: We combine order() with [:

iris[order(names(iris))]