4 Subsetting

4.1 Selecting multiple elements

  1. Q: Fix each of the following common data frame subsetting errors:

    mtcars[mtcars$cyl = 4, ]       # use `==`              (instead of `=`)
    mtcars[-1:4, ]                 # use `-(1:4)`          (instead of `-1:4`)
    mtcars[mtcars$cyl <= 5]        # `,` is missing
    mtcars[mtcars$cyl == 4 | 6, ]  # use `mtcars$cyl == 6` (instead of `6`)
  2. Q: Why does the following code yield five missing values? (Hint: why is it different from x[NA_real_]?)

    x <- 1:5
    x[NA]
    #> [1] NA NA NA NA NA

    A: NA has logical type and logical vectors are recycled to the same length as the vector being subset, i.e. x[NA] is recycled to x[NA, NA, NA, NA, NA].

  3. Q: What does upper.tri() return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?

    x <- outer(1:5, 1:5, FUN = "*")
    x[upper.tri(x)]

    A: upper.tri() returns a logical matrix containing TRUE for all upper diagonal elements and FALSE otherwise. The implementation of upper.tri() is straightforward, but quite interesting as it uses .row(dim(x)) <= .col(dim(x)) to create the logical matrix. Its subsetting-behaviour will be identical to subsetting with logical matrices, where all elements that correspond to TRUE will be selected. We don’t need to treat this form of subsetting in a special way.

  4. Q: Why does mtcars[1:20] return an error? How does it differ from the similar mtcars[1:20, ]?

    A: When subsetting a data frame with a single vector, it behaves the same way as subsetting a list of the columns, so mtcars[1:20] would return a data frame of the first 20 columns of the dataset. But mtcars has only 11 columns, so the index will be out of bounds and an error is thrown. mtcars[1:20, ] is subsetted with two vectors, so 2d subsetting kicks in, and the first index refers to rows.

  5. Q: Implement your own function that extracts the diagonal entries from a matrix (it should behave like diag(x) where x is a matrix).

    A: The elements in the diagonal of a matrix have the same row- and column indices. This characteristic can be used to create a suitable numeric matrix used for subsetting.

    diag2 <- function(x){
      n <- min(nrow(x), ncol(x))
      idx <- cbind(seq_len(n), seq_len(n))
    
      x[idx]
    }
    
    # Let's check if it works
    (x <- matrix(1:30, 5))
    #>      [,1] [,2] [,3] [,4] [,5] [,6]
    #> [1,]    1    6   11   16   21   26
    #> [2,]    2    7   12   17   22   27
    #> [3,]    3    8   13   18   23   28
    #> [4,]    4    9   14   19   24   29
    #> [5,]    5   10   15   20   25   30
    
    diag(x)
    #> [1]  1  7 13 19 25
    diag2(x)
    #> [1]  1  7 13 19 25
  6. Q: What does df[is.na(df)] <- 0 do? How does it work?

    A: This expression replaces the NAs in df with 0. Here is.na(df) returns a logical matrix that encodes the position of the missing values in df. Subsetting and assignment are then combined to replace only the missing values.

4.2 Selecting a single element

  1. Q: Brainstorm as many ways as possible to extract the third value from the cyl variable in the mtcars dataset.

    A: Base R already provides an abundance of possibilities:

    # Select column first
    mtcars$cyl[[3]]
    #> [1] 4
    mtcars[ , "cyl"][[3]]
    #> [1] 4
    mtcars[["cyl"]][[3]]
    #> [1] 4
    with(mtcars, cyl[[3]])
    #> [1] 4
    
    # Select row first
    mtcars[3, ]$cyl
    #> [1] 4
    mtcars[3, "cyl"]
    #> [1] 4
    mtcars[3, ][ , "cyl"]
    #> [1] 4
    mtcars[3, ][["cyl"]]
    #> [1] 4
    
    # Select simultaneously
    mtcars[3, 2]
    #> [1] 4
    mtcars[[c(2, 3)]]
    #> [1] 4
  2. Q: Given a linear model, e.g., mod <- lm(mpg ~ wt, data = mtcars), extract the residual degrees of freedom. Extract the R squared from the model summary (summary(mod)).

    A: mod has the type list, which opens up several possibilities:

    mod <- lm(mpg ~ wt, data = mtcars)
    
    mod$df.residual       # output preserved
    #> [1] 30
    mod$df.res            # `$` allows partial matching
    #> [1] 30
    mod["df.residual"]    # list output
    #> $df.residual
    #> [1] 30
    mod[["df.residual"]]  # output preserved
    #> [1] 30

    The same also applies to summary(mod), so we could use i.e.:

    summary(mod)$r.squared

    (Tip: The broom-package provides a very useful approach to work with models in a tidy way).

4.3 Applications

  1. Q: How would you randomly permute the columns of a data frame? (This is an important technique in random forests). Can you simultaneously permute the rows and columns in one step?

    A: This can be achieved by combining `[` and sample():

    # Permute columns
    iris[sample(ncol(iris))]
    
    # Permute columns and rows in one step
    iris[sample(nrow(iris)), sample(ncol(iris)), drop = FALSE]
  2. Q: How would you select a random sample of m rows from a data frame? What if the sample had to be contiguous (i.e., with an initial row, a final row, and every row in between)?

    A: Selecting m random rows from a data frame can be achieved through subsetting.

    m <- 10
    iris[sample(nrow(iris), m), , drop = FALSE]

    Keeping subsequent rows together as a “blocked sample” requires only some caution to get the start- and end-index correct.

    start <- sample(nrow(iris) - m + 1, 1)
    end <- start + m - 1
    iris[start:end, , drop = FALSE]
  3. Q: How could you put the columns in a data frame in alphabetical order?

    A: We combine order() with [:

    iris[order(names(iris))]