# 4 Subsetting

## 4.1 Selecting multiple elements

**Q**: Fix each of the following common data frame subsetting errors:`mtcars[mtcars$cyl = 4, ] # use `==` (instead of `=`) mtcars[-1:4, ] # use `-(1:4)` (instead of `-1:4`) mtcars[mtcars$cyl <= 5] # `,` is missing mtcars[mtcars$cyl == 4 | 6, ] # use `mtcars$cyl == 6` (instead of `6`)`

**Q**: Why does the following code yield five missing values? (Hint: why is it different from`x[NA_real_]`

?)`x <- 1:5 x[NA] #> [1] NA NA NA NA NA`

**A**:`NA`

has logical type and logical vectors are recycled to the same length as the vector being subset, i.e.`x[NA]`

is recycled to`x[NA, NA, NA, NA, NA]`

.**Q**: What does`upper.tri()`

return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?`x <- outer(1:5, 1:5, FUN = "*") x[upper.tri(x)]`

**A**:`upper.tri()`

returns a logical matrix containing`TRUE`

for all upper diagonal elements and`FALSE`

otherwise. The implementation of`upper.tri()`

is straightforward, but quite interesting as it uses`.row(dim(x)) <= .col(dim(x))`

to create the logical matrix. Its subsetting-behaviour will be identical to subsetting with logical matrices, where all elements that correspond to`TRUE`

will be selected. We don’t need to treat this form of subsetting in a special way.**Q**: Why does`mtcars[1:20]`

return an error? How does it differ from the similar`mtcars[1:20, ]`

?**A**: When subsetting a data frame with a single vector, it behaves the same way as subsetting a list of the columns, so`mtcars[1:20]`

would return a data frame of the first 20 columns of the dataset. But`mtcars`

has only 11 columns, so the index will be out of bounds and an error is thrown.`mtcars[1:20, ]`

is subsetted with two vectors, so 2d subsetting kicks in, and the first index refers to rows.**Q**: Implement your own function that extracts the diagonal entries from a matrix (it should behave like`diag(x)`

where`x`

is a matrix).**A**: The elements in the diagonal of a matrix have the same row- and column indices. This characteristic can be used to create a suitable numeric matrix used for subsetting.`diag2 <- function(x){ n <- min(nrow(x), ncol(x)) idx <- cbind(seq_len(n), seq_len(n)) x[idx] } # Let's check if it works (x <- matrix(1:30, 5)) #> [,1] [,2] [,3] [,4] [,5] [,6] #> [1,] 1 6 11 16 21 26 #> [2,] 2 7 12 17 22 27 #> [3,] 3 8 13 18 23 28 #> [4,] 4 9 14 19 24 29 #> [5,] 5 10 15 20 25 30 diag(x) #> [1] 1 7 13 19 25 diag2(x) #> [1] 1 7 13 19 25`

**Q**: What does`df[is.na(df)] <- 0`

do? How does it work?**A**: This expression replaces the`NA`

s in`df`

with`0`

. Here`is.na(df)`

returns a logical matrix that encodes the position of the missing values in`df`

. Subsetting and assignment are then combined to replace only the missing values.

## 4.2 Selecting a single element

**Q**: Brainstorm as many ways as possible to extract the third value from the`cyl`

variable in the`mtcars`

dataset.**A**: Base R already provides an abundance of possibilities:`# Select column first mtcars$cyl[[3]] #> [1] 4 mtcars[ , "cyl"][[3]] #> [1] 4 mtcars[["cyl"]][[3]] #> [1] 4 with(mtcars, cyl[[3]]) #> [1] 4 # Select row first mtcars[3, ]$cyl #> [1] 4 mtcars[3, "cyl"] #> [1] 4 mtcars[3, ][ , "cyl"] #> [1] 4 mtcars[3, ][["cyl"]] #> [1] 4 # Select simultaneously mtcars[3, 2] #> [1] 4 mtcars[[c(2, 3)]] #> [1] 4`

**Q**: Given a linear model, e.g.,`mod <- lm(mpg ~ wt, data = mtcars)`

, extract the residual degrees of freedom. Extract the R squared from the model summary (`summary(mod)`

).**A**:`mod`

has the type list, which opens up several possibilities:`mod <- lm(mpg ~ wt, data = mtcars) mod$df.residual # output preserved #> [1] 30 mod$df.res # `$` allows partial matching #> [1] 30 mod["df.residual"] # list output #> $df.residual #> [1] 30 mod[["df.residual"]] # output preserved #> [1] 30`

The same also applies to

`summary(mod)`

, so we could use i.e.:`summary(mod)$r.squared`

(Tip: The

`broom`

-package provides a very useful approach to work with models in a tidy way).

## 4.3 Applications

**Q**: How would you randomly permute the columns of a data frame? (This is an important technique in random forests). Can you simultaneously permute the rows and columns in one step?**A**: This can be achieved by combining``[``

and`sample()`

:`# Permute columns iris[sample(ncol(iris))] # Permute columns and rows in one step iris[sample(nrow(iris)), sample(ncol(iris)), drop = FALSE]`

**Q**: How would you select a random sample of`m`

rows from a data frame? What if the sample had to be contiguous (i.e., with an initial row, a final row, and every row in between)?**A**: Selecting`m`

random rows from a data frame can be achieved through subsetting.`m <- 10 iris[sample(nrow(iris), m), , drop = FALSE]`

Keeping subsequent rows together as a “blocked sample” requires only some caution to get the start- and end-index correct.

`start <- sample(nrow(iris) - m + 1, 1) end <- start + m - 1 iris[start:end, , drop = FALSE]`

**Q**: How could you put the columns in a data frame in alphabetical order?**A**: We combine`order()`

with`[`

:`iris[order(names(iris))]`