# 4 Subsetting

## 4.1 Selecting multiple elements

**Q**: Fix each of the following common data frame subsetting errors:**Q**: Why does the following code yield five missing values? (Hint: why is it different from`x[NA_real_]`

?)**A**:`NA`

has logical type and and internally`x[NA]`

will be recycled to`x[NA, NA, NA, NA, NA]`

. Subsetting an atomic with`NA`

returns`NA`

which occurs five times. That is why five missing values are returned.**Q**: What does`upper.tri()`

return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?**A**:`upper.tri()`

returns a logical matrix containing`TRUE`

for all upper diagonal elements and`FALSE`

otherwise. The implementation of`upper.tri()`

is straightforward, but quite interesting as it uses`.row(dim(x)) <= .col(dim(x))`

to create the logical matrix. Its subsetting-behaviour will be identical to subsetting with logical matrices, where all elements that correspond to`TRUE`

will be selected. We don’t need to treat this form of subsetting in a special way.**Q**: Why does`mtcars[1:20]`

return an error? How does it differ from the similar`mtcars[1:20, ]`

?**A**:`mtcars[1:20]`

is subsetted with a vector and in general this statement would return a data frame of the first 20 columns of the dataset. But`mtcars`

has only 11 columns, so the index will be out of bounds and an error is thrown.`mtcars[1:20, ]`

is subsetted with two vectors and the first 20 rows of all columns will be returned.**Q**: Implement your own function that extracts the diagonal entries from a matrix (it should behave like`diag(x)`

where`x`

is a matrix).**A**: The elements in the diagonal of a matrix have the same row- and column indices. This characteristic can be used to create a suitable numeric matrix used for subsetting.`diag2 <- function(x){ n <- min(dim(x)) indices <- seq_len(n) diag_matrix <- matrix(rep(indices, 2), ncol = 2) x[diag_matrix] } # Let's check if it works (x <- matrix(1:30, 5)) #> [,1] [,2] [,3] [,4] [,5] [,6] #> [1,] 1 6 11 16 21 26 #> [2,] 2 7 12 17 22 27 #> [3,] 3 8 13 18 23 28 #> [4,] 4 9 14 19 24 29 #> [5,] 5 10 15 20 25 30 diag(x) #> [1] 1 7 13 19 25 diag2(x) #> [1] 1 7 13 19 25`

**Q**: What does`df[is.na(df)] <- 0`

do? How does it work?**A**: This expression replaces the`NA`

s in`df`

with`0`

. Here`is.na(df)`

returns a logical matrix and encodes the position of the missing values in`df`

. Subsetting and assignment are then combined to replace only the missing values.

## 4.2 Selecting a single element

**Q**: Brainstorm as many ways as possible to extract the third value from the`cyl`

variable in the`mtcars`

dataset.**A**: Base R already provides an abundance of possibilities:`# using [[3]] instead of [3] would also work in these examples mtcars$cyl[3] #> [1] 4 mtcars[ , "cyl"][3] #> [1] 4 mtcars[["cyl"]][3] #> [1] 4 with(mtcars, cyl[3]) #> [1] 4 mtcars[3, 2] #> [1] 4 mtcars[3, ]$cyl #> [1] 4 mtcars[3, "cyl"] #> [1] 4 mtcars[3, ][ , "cyl"] #> [1] 4 mtcars[3, ][["cyl"]] #> [1] 4 with(mtcars[3, ], cyl) #> [1] 4 tail(head(mtcars, 3), 1)$cyl #> [1] 4 head(tail(mtcars, 30), 1)$cyl # not very practical ;) #> [1] 4 subset(mtcars, rownames(mtcars) == "Datsun 710")$cyl #> [1] 4`

When we turn to other libraries, e.g. the tidyverse packages, even more possibilities open up. As an example:

**Q**: Given a linear model, e.g.,`mod <- lm(mpg ~ wt, data = mtcars)`

, extract the residual degrees of freedom. Extract the R squared from the model summary (`summary(mod)`

).**A**:`mod`

has the type list, which opens up several possibilities:`mod <- lm(mpg ~ wt, data = mtcars) mod$df.residual # output preserved #> [1] 30 mod$df.res # `$` allows partial matching #> [1] 30 mod["df.residual"] # list output #> $df.residual #> [1] 30 mod[["df.residual"]] # output preserved #> [1] 30`

The same also applies to

`summary(mod)`

, so we could use i.e.:(Tip: The

`broom`

-package provides a very useful approach to work with models in a tidy way).

## 4.3 Applications

**Q**: How would you randomly permute the columns of a data frame? (This is an important technique in random forests). Can you simultaneously permute the rows and columns in one step?**A**: This can be achieved by combining``[``

and`sample()`

:**Q**: How would you select a random sample of`m`

rows from a data frame? What if the sample had to be contiguous (i.e., with an initial row, a final row, and every row in between)?**A**: Selecting`m`

random rows from a data frame can be achieved through subsetting.Keeping subsequent rows together as a “blocked sample” requires only some caution to get the start- and end-index correct.

**Q**: How could you put the columns in a data frame in alphabetical order?**A**: We first sort the column names alphabetically and use this vector to subset the data frame: