3 Subsetting
3.1 Selecting multiple elements
Q1: Fix each of the following common data frame subsetting errors:
$cyl = 4, ]
mtcars[mtcars# use `==` (instead of `=`)
-1:4, ]
mtcars[# use `-(1:4)` (instead of `-1:4`)
$cyl <= 5]
mtcars[mtcars# `,` is missing
$cyl == 4 | 6, ]
mtcars[mtcars# use `mtcars$cyl == 6` (instead of `6`)
# or `%in% c(4, 6)` (instead of `== 4 | 6`)
Q2: Why does the following code yield five missing values? (Hint: why is it different from x[NA_real_]
?)
x <- 1:5
x[NA]
#> [1] NA NA NA NA NA
A: In contrast to NA_real
, NA
has logical type and logical vectors are recycled to the same length as the vector being subset, i.e. x[NA]
is recycled to x[NA, NA, NA, NA, NA]
.
Q3: What does upper.tri()
return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?
A: upper.tri(x)
returns a logical matrix, which contains TRUE
values above the diagonal and FALSE
values everywhere else. In upper.tri()
the positions for TRUE
and FALSE
values are determined by comparing x
’s row and column indices via .row(dim(x)) < .col(dim(x))
.
x
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 1 2 3 4 5
#> [2,] 2 4 6 8 10
#> [3,] 3 6 9 12 15
#> [4,] 4 8 12 16 20
#> [5,] 5 10 15 20 25
upper.tri(x)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] FALSE TRUE TRUE TRUE TRUE
#> [2,] FALSE FALSE TRUE TRUE TRUE
#> [3,] FALSE FALSE FALSE TRUE TRUE
#> [4,] FALSE FALSE FALSE FALSE TRUE
#> [5,] FALSE FALSE FALSE FALSE FALSE
When subsetting with logical matrices, all elements that correspond to TRUE
will be selected. Matrices extend vectors with a dimension attribute, so the vector forms of subsetting can be used (including logical subsetting). We should take care, that the dimensions of the subsetting matrix match the object of interest — otherwise unintended selections due to vector recycling may occur. Please also note, that this form of subsetting returns a vector instead of a matrix, as the subsetting alters the dimensions of the object.
x[upper.tri(x)]
#> [1] 2 3 6 4 8 12 5 10 15 20
Q4: Why does mtcars[1:20]
return an error? How does it differ from the similar mtcars[1:20, ]
?
A: When subsetting a data frame with a single vector, it behaves the same way as subsetting a list of columns. So, mtcars[1:20]
would return a data frame containing the first 20 columns of the dataset. However, as mtcars
has only 11 columns, the index will be out of bounds and an error is thrown. mtcars[1:20, ]
is subsetted with two vectors, so 2d subsetting kicks in, and the first index refers to rows.
Q5: Implement your own function that extracts the diagonal entries from a matrix (it should behave like diag(x)
where x
is a matrix).
A: The elements in the diagonal of a matrix have the same row- and column indices. This characteristic can be used to create a suitable numeric matrix used for subsetting.
diag2 <- function(x) {
n <- min(nrow(x), ncol(x))
idx <- cbind(seq_len(n), seq_len(n))
x[idx]
}
# Let's check if it works
(x <- matrix(1:30, 5))
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 1 6 11 16 21 26
#> [2,] 2 7 12 17 22 27
#> [3,] 3 8 13 18 23 28
#> [4,] 4 9 14 19 24 29
#> [5,] 5 10 15 20 25 30
diag(x)
#> [1] 1 7 13 19 25
diag2(x)
#> [1] 1 7 13 19 25
Q6: What does df[is.na(df)] <- 0
do? How does it work?
A: This expression replaces the NA
s in df
with 0
. Here is.na(df)
returns a logical matrix that encodes the position of the missing values in df
. Subsetting and assignment are then combined to replace only the missing values.
3.2 Selecting a single element
Q1: Brainstorm as many ways as possible to extract the third value from the cyl
variable in the mtcars
dataset.
A: Base R already provides an abundance of possibilities:
# Select column first
mtcars$cyl[[3]]
#> [1] 4
mtcars[ , "cyl"][[3]]
#> [1] 4
mtcars[["cyl"]][[3]]
#> [1] 4
with(mtcars, cyl[[3]])
#> [1] 4
# Select row first
mtcars[3, ]$cyl
#> [1] 4
mtcars[3, "cyl"]
#> [1] 4
mtcars[3, ][ , "cyl"]
#> [1] 4
mtcars[3, ][["cyl"]]
#> [1] 4
# Select simultaneously
mtcars[3, 2]
#> [1] 4
mtcars[[c(2, 3)]]
#> [1] 4
Q2: Given a linear model, e.g. mod <- lm(mpg ~ wt, data = mtcars)
, extract the residual degrees of freedom. Extract the R squared from the model summary (summary(mod)
).
A: mod
is of type list, which opens up several possibilities. We use $
or [[
to extract a single element:
mod <- lm(mpg ~ wt, data = mtcars)
mod$df.residual
#> [1] 30
mod[["df.residual"]]
#> [1] 30
The same also applies to summary(mod)
, so we could use, e.g.:
summary(mod)$r.squared
#> [1] 0.753
(Tip: The {broom}
package11 provides a very useful approach to work with models in a tidy way.)
3.3 Applications
Q1: How would you randomly permute the columns of a data frame? (This is an important technique in random forests.) Can you simultaneously permute the rows and columns in one step?
A: This can be achieved by combining [
and sample()
:
# Permute columns
mtcars[sample(ncol(mtcars))]
# Permute columns and rows in one step
mtcars[sample(nrow(mtcars)), sample(ncol(mtcars))]
Q2: How would you select a random sample of m
rows from a data frame? What if the sample had to be contiguous (i.e. with an initial row, a final row, and every row in between)?
A: Selecting m
random rows from a data frame can be achieved through subsetting.
Holding successive lines together as a blocked sample requires only a certain amount of caution in order to obtain the correct start and end index.
Q3: How could you put the columns in a data frame in alphabetical order?