23 Subsetting

23.1 Data types

  1. Q: Fix each of the following common data frame subsetting errors:

  2. Q: Why does x <- 1:5; x[NA] yield five missing values? (Hint: why is it different from x[NA_real_]?)
    A: NA is of class logical, so x[NA] becomes recycled to x[NA, NA, NA, NA, NA]. Since subsetting an atomic with NA leads to an NA, you will get 5 of them returned.

  3. Q: What does upper.tri() return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?

    A: upper.tri() has really intuitive source code. It coerces it’s input to a matrix and returns a logical matrix. Hence describing it’s behaviour for the use of subsetting is based on everything that applies to subsetting with logical matrices.

  4. Q: Why does mtcars[1:20] return an error? How does it differ from the similar mtcars[1:20, ]?
    A: In the first case mtcar is subsetted with a vector and the statement should return a data.frame of the first 20 columns in mtcars. Since mtcars only has 11 columns, the index is out of bounds, which explains the error. The biggest difference of mtcars[1:20, ] to the former case, is that now mtcars is subsetted with two vectors. In this case you will get returned the first 20 rows and all columns (like subsetting a matrix).

  5. Q: Implement your own function that extracts the diagonal entries from a matrix (it should behave like diag(x) where x is a matrix).
    A: First we copy the relevant part of the source code from the diag() function:

    In the next step we drop the unncessary nrow and ncol argument and the related code in the 3rd and 4th line:

    If we look for the idea to capture the diagonal elements, we can see that the input matrix is subsetted with a vector, so we called this function diag_v(). Of course we can implement our own function diag_m(), where we subset with a matrix.

    Now we can check if we get the same results as the original function and also compare the speed. Therefore we convert the relatively large diamonds dataset from the ggplot2 package into a matrix.

    The original function seems to be a little bit faster than the trimmed and our matrix version. Maybe this is due to compiling issues

    We can see that our diag_m version is only a little bit slower than the original version. However the source code of the matrix version could be a bit easier to read.

    We could also take an idea from the source code of upper.tri() and subset with a logical vector (but it turns out to be really slow):

    compile it and compare it with the other versions

  6. Q: What does df[is.na(df)] <- 0 do? How does it work?
    A: It replaces all NAs within df with the value 0. is.na(df) returns a logical matrix which is used to subset df. Since you can combine subsetting and assignment, only the matched part of df (the NAs) is replaced with 0 entries.

23.2 Subsetting operators

  1. Q: Given a linear model, e.g., mod <- lm(mpg ~ wt, data = mtcars), extract the residual degrees of freedom. Extract the R squared from the model summary (summary(mod))
    A: Since mod is of type list we can expect several possibilities:

    The same states for summary(mod), so we can use for example:

    (To get tidy output from r-models in general also the broom package is a good alternative).

23.3 Applications

  1. Q: How would you randomly permute the columns of a data frame? (This is an important technique in random forests.) Can you simultaneously permute the rows and columns in one step?
    A: Combine `[` with the sample() function:

  2. Q: How would you select a random sample of m rows from a data frame? What if the sample had to be contiguous (i.e., with an initial row, a final row, and every row in between)?
    A: For example

  3. Q: How could you put the columns in a data frame in alphabetical order?
    A: We can sort the names and subset by name: