4 Subsetting

4.1 Selecting multiple elements

  1. Q: Fix each of the following common data frame subsetting errors:

  2. Q: Why does the following code yield five missing values? (Hint: why is it different from x[NA_real_]?)

    A: NA has logical type and and internally x[NA] will be recycled to x[NA, NA, NA, NA, NA]. Subsetting an atomic with NA returns NA which occurs five times. That is why five missing values are returned.

  3. Q: What does upper.tri() return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?

    A: upper.tri() returns a logical matrix containing TRUE for all upper diagonal elements and FALSE otherwise. The implementation of upper.tri() is straightforward, but quite interesting as it uses .row(dim(x)) <= .col(dim(x)) to create the logical matrix. Its subsetting-behaviour will be identical to subsetting with logical matrices, where all elements that correspond to TRUE will be selected. We don’t need to treat this form of subsetting in a special way.

  4. Q: Why does mtcars[1:20] return an error? How does it differ from the similar mtcars[1:20, ]?

    A: mtcars[1:20] is subsetted with a vector and in general this statement would return a data frame of the first 20 columns of the dataset. But mtcars has only 11 columns, so the index will be out of bounds and an error is thrown. mtcars[1:20, ] is subsetted with two vectors and the first 20 rows of all columns will be returned.

  5. Q: Implement your own function that extracts the diagonal entries from a matrix (it should behave like diag(x) where x is a matrix).

    A: The elements in the diagonal of a matrix have the same row- and column indices. This characteristic can be used to create a suitable numeric matrix used for subsetting.

  6. Q: What does df[is.na(df)] <- 0 do? How does it work?

    A: This expression replaces the NAs in df with 0. Here is.na(df) returns a logical matrix and encodes the position of the missing values in df. Subsetting and assignment are then combined to replace only the missing values.

4.2 Selecting a single element

  1. Q: Brainstorm as many ways as possible to extract the third value from the cyl variable in the mtcars dataset.

    A: Base R already provides an abundance of possibilities:

    When we turn to other libraries, e.g. the tidyverse packages, even more possibilities open up. As an example:

  2. Q: Given a linear model, e.g., mod <- lm(mpg ~ wt, data = mtcars), extract the residual degrees of freedom. Extract the R squared from the model summary (summary(mod)).

    A: mod has the type list, which opens up several possibilities:

    The same also applies to summary(mod), so we could use i.e.:

    (Tip: The broom-package provides a very useful approach to work with models in a tidy way).

4.3 Applications

  1. Q: How would you randomly permute the columns of a data frame? (This is an important technique in random forests). Can you simultaneously permute the rows and columns in one step?

    A: This can be achieved by combining `[` and sample():

  2. Q: How would you select a random sample of m rows from a data frame? What if the sample had to be contiguous (i.e., with an initial row, a final row, and every row in between)?

    A: Selecting m random rows from a data frame can be achieved through subsetting.

    Keeping subsequent rows together as a “blocked sample” requires only some caution to get the start- and end-index correct.

  3. Q: How could you put the columns in a data frame in alphabetical order?

    A: We first sort the column names alphabetically and use this vector to subset the data frame: