# 3 Subsetting

## 3.1 Selecting multiple elements

**Q**: Fix each of the following common data frame subsetting errors:**Q**: Why does the following code yield five missing values? (Hint: why is it different from`x[NA_real_]`

?)**A**:`NA`

is of class logical, so`x[NA]`

becomes recycled to`x[NA, NA, NA, NA, NA]`

. Since subsetting an atomic with`NA`

leads to an`NA`

, you will get 5 of them returned.**Q**: What does`upper.tri()`

return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?**A**:`upper.tri()`

has really intuitive source code. It coerces its input to a matrix and returns a logical matrix. Hence describing its behaviour for the use of subsetting is based on everything that applies to subsetting with logical matrices.**Q**: Why does`mtcars[1:20]`

return an error? How does it differ from the similar`mtcars[1:20, ]`

?**A**: In the first case`mtcar`

is subsetted with a vector and the statement should return a data.frame of the first 20 columns in`mtcars`

. Since`mtcars`

only has 11 columns, the index is out of bounds, which explains the error. The biggest difference of`mtcars[1:20, ]`

to the former case, is that now`mtcars`

is subsetted with two vectors. In this case you will get returned the first 20 rows and all columns (like subsetting a matrix).**Q**: Implement your own function that extracts the diagonal entries from a matrix (it should behave like`diag(x)`

where`x`

is a matrix).**A**: First we copy the relevant part of the source code from the`diag()`

function:`function (x = 1, nrow, ncol){ if (is.matrix(x)) { if (nargs() > 1L) # this and the next line will be dropped stop("'nrow' or 'ncol' cannot be specified when 'x' is a matrix") if ((m <- min(dim(x))) == 0L) return(vector(typeof(x), 0L)) y <- x[1 + 0L:(m - 1L) * (dim(x)[1L] + 1)] nms <- dimnames(x) if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]), nms[[2L]][seq_len(m)])) names(y) <- nm return(y) } }`

In the next step we drop the unncessary

`nrow`

and`ncol`

argument and the related code in the 3rd and 4th line:`diag_v <- function (x = 1) { if (is.matrix(x)) { if ((m <- min(dim(x))) == 0L) return(vector(typeof(x), 0L)) y <- x[1 + 0L:(m - 1L) * (dim(x)[1L] + 1)] # subsetting with a vector nms <- dimnames(x) if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]), nms[[2L]][seq_len(m)])) names(y) <- nm return(y) } }`

If we look for the idea to capture the diagonal elements, we can see that the input matrix is subsetted with a vector, so we called this function

`diag_v()`

. Of course we can implement our own function`diag_m()`

, where we subset with a matrix.`diag_m <- function (x = 1) { if (is.matrix(x)) { if ((m <- min(dim(x))) == 0L) return(vector(typeof(x), 0L)) y <- x[matrix(c(1:m,1:m), m)] # subsetting with a matrix nms <- dimnames(x) if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]), nms[[2L]][seq_len(m)])) names(y) <- nm return(y) } }`

Now we can check if we get the same results as the original function and also compare the speed. Therefore we convert the relatively large

`diamonds`

dataset from the`ggplot2`

package into a matrix.`diamonds_m <- as.matrix(ggplot2::diamonds) stopifnot(all.equal(diag(diamonds_m), diag_v(diamonds_m), diag_m(diamonds_m))) # our functions succeed the little test microbenchmark::microbenchmark( diag(diamonds_m), diag_v(diamonds_m), diag_m(diamonds_m) ) #> Unit: microseconds #> expr min lq mean median uq max #> diag(diamonds_m) 4.684 5.3045 6.59819 5.5620 5.9810 63.463 #> diag_v(diamonds_m) 12.118 12.4920 13.47695 12.8195 13.5435 24.216 #> diag_m(diamonds_m) 13.699 14.3170 120.94208 14.6885 16.1765 10523.849 #> neval #> 100 #> 100 #> 100`

The original function seems to be a little bit faster than the trimmed and our matrix version. Maybe this is due to compiling issues

`diag_c <- compiler::cmpfun(diag) diag_v_c <- compiler::cmpfun(diag_v) diag_m_c <- compiler::cmpfun(diag_m) # Now we can make a fair comparison of the speed: microbenchmark::microbenchmark( diag_c(diamonds_m), diag_v_c(diamonds_m), diag_m_c(diamonds_m) ) #> Unit: microseconds #> expr min lq mean median uq max neval #> diag_c(diamonds_m) 4.767 5.3015 6.32044 5.5495 5.937 22.950 100 #> diag_v_c(diamonds_m) 12.184 12.6275 15.70798 13.2845 14.766 34.008 100 #> diag_m_c(diamonds_m) 14.010 14.5005 19.17780 14.9925 18.480 115.717 100`

We can see that our diag_m version is only a little bit slower than the original version. However the source code of the matrix version could be a bit easier to read.

We could also take an idea from the source code of

`upper.tri()`

and subset with a logical vector (but it turns out to be really slow):`diag_lv <- function (x = 1) { if (is.matrix(x)) { if ((m <- min(dim(x))) == 0L) return(vector(typeof(x), 0L)) y <- x[row(x) == col(x)] nms <- dimnames(x) if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]), nms[[2L]][seq_len(m)])) names(y) <- nm return(y) } }`

compile it and compare it with the other versions

`diag_lv_c <- compiler::cmpfun(diag_lv) stopifnot(all.equal(diag(diamonds_m), diag_v_c(diamonds_m), diag_m_c(diamonds_m), diag_lv_c(diamonds_m))) # our functions succeed the little test microbenchmark::microbenchmark( diag(diamonds_m), diag_v_c(diamonds_m), diag_m_c(diamonds_m), diag_lv_c(diamonds_m) ) #> Unit: microseconds #> expr min lq mean median uq #> diag(diamonds_m) 4.699 6.2265 12.14297 10.6135 16.3455 #> diag_v_c(diamonds_m) 12.325 14.8560 20.93825 19.4540 25.6610 #> diag_m_c(diamonds_m) 14.217 17.8720 26.91302 26.5535 33.4090 #> diag_lv_c(diamonds_m) 2228.359 2346.3610 3681.55335 3044.9975 5030.9455 #> max neval #> 55.894 100 #> 45.841 100 #> 68.303 100 #> 8055.300 100`

**Q**: What does`df[is.na(df)] <- 0`

do? How does it work?**A**: It replaces all`NA`

s within`df`

with the value`0`

.`is.na(df)`

returns a logical matrix which is used to subset df. Since you can combine subsetting and assignment, only the matched part of`df`

(the`NA`

s) is replaced with`0`

entries.

## 3.2 Selecting a single element

**Q**: Brainstorm as many ways as possible to extract the third value from the`cyl`

variable in the`mtcars`

dataset.**A**: Lets first look at the usual base R options:`mtcars$cyl[3] #> [1] 4 mtcars$cyl[[3]] #> [1] 4 mtcars[ , "cyl"][3] #> [1] 4 mtcars[ , "cyl"][[3]] #> [1] 4 mtcars[["cyl"]][3] #> [1] 4 mtcars[["cyl"]][[3]] #> [1] 4 with(mtcars, cyl[3]) #> [1] 4 with(mtcars, cyl[[3]]) #> [1] 4 mtcars[3, 2] #> [1] 4 mtcars[3, ]$cyl #> [1] 4 mtcars[3, "cyl"] #> [1] 4 mtcars[3, ][ , "cyl"] #> [1] 4 mtcars[3, ][["cyl"]] #> [1] 4 with(mtcars[3, ], cyl) #> [1] 4`

Of course we could also use addon packages like the tidyverse for the subsetting of the

`cyl`

variable and its third element:Many more packages, which we are currently not aware of might provide similar functionality.

**Q**: Given a linear model, e.g.,`mod <- lm(mpg ~ wt, data = mtcars)`

, extract the residual degrees of freedom. Extract the R squared from the model summary (`summary(mod)`

).**A**: Since`mod`

is of type list we can expect several possibilities:`mod$df.residual # preserving output mod$df.r # preserving output with partial matching mod["df.residual"] # list output (without partial matching) mod[["df.residual"]] # preserving output (without partial matching)`

The same states for

`summary(mod)`

, so we can use for example:(To get tidy output from r-models in general also the

`broom`

package is a good alternative).

## 3.3 Applications

**Q**: How would you randomly permute the columns of a data frame? (This is an important technique in random forests.) Can you simultaneously permute the rows and columns in one step?**A**: Combine``[``

with the`sample()`

function:**Q**: How would you select a random sample of`m`

rows from a data frame? What if the sample had to be contiguous (i.e., with an initial row, a final row, and every row in between)?**A**: For example**Q**: How could you put the columns in a data frame in alphabetical order?**A**: We can sort the column names first and then use the result to subset the data frame for all columns: