# 3 Subsetting

## 3.1 Selecting multiple elements

1. Q: Fix each of the following common data frame subsetting errors:

mtcars[mtcars$cyl = 4, ] # = -> == mtcars[-1:4, ] # -1:4 -> -(1:4) mtcars[mtcars$cyl <= 5]        # "," is missing
mtcars[mtcars$cyl == 4 | 6, ] # 6 -> mtcars$cyl == 6
2. Q: Why does the following code yield five missing values? (Hint: why is it different from x[NA_real_]?)

x <- 1:5
x[NA]
#> [1] NA NA NA NA NA

A: NA is of class logical, so x[NA] becomes recycled to x[NA, NA, NA, NA, NA]. Since subsetting an atomic with NA leads to an NA, you will get 5 of them returned.

3. Q: What does upper.tri() return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?

x <- outer(1:5, 1:5, FUN = "*")
x[upper.tri(x)]

A: upper.tri() has really intuitive source code. It coerces its input to a matrix and returns a logical matrix. Hence describing its behaviour for the use of subsetting is based on everything that applies to subsetting with logical matrices.

4. Q: Why does mtcars[1:20] return an error? How does it differ from the similar mtcars[1:20, ]?

A: In the first case mtcar is subsetted with a vector and the statement should return a data.frame of the first 20 columns in mtcars. Since mtcars only has 11 columns, the index is out of bounds, which explains the error. The biggest difference of mtcars[1:20, ] to the former case, is that now mtcars is subsetted with two vectors. In this case you will get returned the first 20 rows and all columns (like subsetting a matrix).

5. Q: Implement your own function that extracts the diagonal entries from a matrix (it should behave like diag(x) where x is a matrix).

A: First we copy the relevant part of the source code from the diag() function:

function (x = 1, nrow, ncol){
if (is.matrix(x)) {
if (nargs() > 1L) # this and the next line will be dropped
stop("'nrow' or 'ncol' cannot be specified when 'x' is a matrix")
if ((m <- min(dim(x))) == 0L)
return(vector(typeof(x), 0L))
y <- x[1 + 0L:(m - 1L) * (dim(x)[1L] + 1)]
nms <- dimnames(x)
if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]),
nms[[2L]][seq_len(m)]))
names(y) <- nm
return(y)
}
}

In the next step we drop the unncessary nrow and ncol argument and the related code in the 3rd and 4th line:

diag_v <- function (x = 1) {
if (is.matrix(x)) {
if ((m <- min(dim(x))) == 0L)
return(vector(typeof(x), 0L))
y <- x[1 + 0L:(m - 1L) * (dim(x)[1L] + 1)] # subsetting with a vector
nms <- dimnames(x)
if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]),
nms[[2L]][seq_len(m)]))
names(y) <- nm
return(y)
}
}

If we look for the idea to capture the diagonal elements, we can see that the input matrix is subsetted with a vector, so we called this function diag_v(). Of course we can implement our own function diag_m(), where we subset with a matrix.

diag_m <- function (x = 1) {
if (is.matrix(x)) {
if ((m <- min(dim(x))) == 0L)
return(vector(typeof(x), 0L))
y <- x[matrix(c(1:m,1:m), m)] # subsetting with a matrix
nms <- dimnames(x)
if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]),
nms[[2L]][seq_len(m)]))
names(y) <- nm
return(y)
}
}

Now we can check if we get the same results as the original function and also compare the speed. Therefore we convert the relatively large diamonds dataset from the ggplot2 package into a matrix.

diamonds_m <- as.matrix(ggplot2::diamonds)

stopifnot(all.equal(diag(diamonds_m),
diag_v(diamonds_m),
diag_m(diamonds_m))) # our functions succeed the little test

microbenchmark::microbenchmark(
diag(diamonds_m),
diag_v(diamonds_m),
diag_m(diamonds_m)
)
#> Unit: microseconds
#>                expr    min      lq      mean  median      uq       max
#>    diag(diamonds_m)  4.684  5.3045   6.59819  5.5620  5.9810    63.463
#>  diag_v(diamonds_m) 12.118 12.4920  13.47695 12.8195 13.5435    24.216
#>  diag_m(diamonds_m) 13.699 14.3170 120.94208 14.6885 16.1765 10523.849
#>  neval
#>    100
#>    100
#>    100

The original function seems to be a little bit faster than the trimmed and our matrix version. Maybe this is due to compiling issues

diag_c <- compiler::cmpfun(diag)
diag_v_c <- compiler::cmpfun(diag_v)
diag_m_c <- compiler::cmpfun(diag_m)

# Now we can make a fair comparison of the speed:

microbenchmark::microbenchmark(
diag_c(diamonds_m),
diag_v_c(diamonds_m),
diag_m_c(diamonds_m)
)
#> Unit: microseconds
#>                  expr    min      lq     mean  median     uq     max neval
#>    diag_c(diamonds_m)  4.767  5.3015  6.32044  5.5495  5.937  22.950   100
#>  diag_v_c(diamonds_m) 12.184 12.6275 15.70798 13.2845 14.766  34.008   100
#>  diag_m_c(diamonds_m) 14.010 14.5005 19.17780 14.9925 18.480 115.717   100

We can see that our diag_m version is only a little bit slower than the original version. However the source code of the matrix version could be a bit easier to read.

We could also take an idea from the source code of upper.tri() and subset with a logical vector (but it turns out to be really slow):

diag_lv <- function (x = 1) {
if (is.matrix(x)) {
if ((m <- min(dim(x))) == 0L)
return(vector(typeof(x), 0L))
y <- x[row(x) == col(x)]
nms <- dimnames(x)
if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]),
nms[[2L]][seq_len(m)]))
names(y) <- nm
return(y)
}
}

compile it and compare it with the other versions

diag_lv_c <- compiler::cmpfun(diag_lv)
stopifnot(all.equal(diag(diamonds_m),
diag_v_c(diamonds_m),
diag_m_c(diamonds_m),
diag_lv_c(diamonds_m))) # our functions succeed the little test

microbenchmark::microbenchmark(
diag(diamonds_m),
diag_v_c(diamonds_m),
diag_m_c(diamonds_m),
diag_lv_c(diamonds_m)
)
#> Unit: microseconds
#>                   expr      min        lq       mean    median        uq
#>       diag(diamonds_m)    4.699    6.2265   12.14297   10.6135   16.3455
#>   diag_v_c(diamonds_m)   12.325   14.8560   20.93825   19.4540   25.6610
#>   diag_m_c(diamonds_m)   14.217   17.8720   26.91302   26.5535   33.4090
#>  diag_lv_c(diamonds_m) 2228.359 2346.3610 3681.55335 3044.9975 5030.9455
#>       max neval
#>    55.894   100
#>    45.841   100
#>    68.303   100
#>  8055.300   100
6. Q: What does df[is.na(df)] <- 0 do? How does it work?

A: It replaces all NAs within df with the value 0. is.na(df) returns a logical matrix which is used to subset df. Since you can combine subsetting and assignment, only the matched part of df (the NAs) is replaced with 0 entries.

## 3.2 Selecting a single element

1. Q: Brainstorm as many ways as possible to extract the third value from the cyl variable in the mtcars dataset.

A: Lets first look at the usual base R options:

mtcars$cyl[3] #> [1] 4 mtcars$cyl[[3]]
#> [1] 4
mtcars[ , "cyl"][3]
#> [1] 4
mtcars[ , "cyl"][[3]]
#> [1] 4
mtcars[["cyl"]][3]
#> [1] 4
mtcars[["cyl"]][[3]]
#> [1] 4
with(mtcars, cyl[3])
#> [1] 4
with(mtcars, cyl[[3]])
#> [1] 4

mtcars[3, 2]
#> [1] 4
mtcars[3, ]$cyl #> [1] 4 mtcars[3, "cyl"] #> [1] 4 mtcars[3, ][ , "cyl"] #> [1] 4 mtcars[3, ][["cyl"]] #> [1] 4 with(mtcars[3, ], cyl) #> [1] 4 Of course we could also use addon packages like the tidyverse for the subsetting of the cyl variable and its third element: library(magrittr) mtcars %>% dplyr::pull(cyl) %>% purrr::pluck(3) #> [1] 4 Many more packages, which we are currently not aware of might provide similar functionality. 2. Q: Given a linear model, e.g., mod <- lm(mpg ~ wt, data = mtcars), extract the residual degrees of freedom. Extract the R squared from the model summary (summary(mod)). A: Since mod is of type list we can expect several possibilities: mod$df.residual       # preserving output
mod$df.r # preserving output with partial matching mod["df.residual"] # list output (without partial matching) mod[["df.residual"]] # preserving output (without partial matching) The same states for summary(mod), so we can use for example: summary(mod)$r.squared

(To get tidy output from r-models in general also the broom package is a good alternative).

## 3.3 Applications

1. Q: How would you randomly permute the columns of a data frame? (This is an important technique in random forests.) Can you simultaneously permute the rows and columns in one step?

A: Combine [ with the sample() function:

# Permute rows
iris[sample(ncol(iris))]

# Permute rows and columns in one step
iris[sample(nrow(iris)), sample(ncol(iris)), drop = FALSE]
2. Q: How would you select a random sample of m rows from a data frame? What if the sample had to be contiguous (i.e., with an initial row, a final row, and every row in between)?

A: For example

m=10
iris[sample(nrow(iris), m),]

# Blockversion
start <- sample(nrow(iris) - m + 1, 1)
end <- start + m - 1
iris[start:end, , drop = FALSE]
3. Q: How could you put the columns in a data frame in alphabetical order?

A: We can sort the column names first and then use the result to subset the data frame for all columns:

iris[sort(names(iris))]