# 3 Subsetting

## 3.1 Selecting multiple elements

**Q1**: Fix each of the following common data frame subsetting errors:

```
$cyl = 4, ]
mtcars[mtcars# use `==` (instead of `=`)
-1:4, ]
mtcars[# use `-(1:4)` (instead of `-1:4`)
$cyl <= 5]
mtcars[mtcars# `,` is missing
$cyl == 4 | 6, ]
mtcars[mtcars# use `mtcars$cyl == 6` (instead of `6`)
# or `%in% c(4, 6)` (instead of `== 4 | 6`)
```

**Q2**: Why does the following code yield five missing values? (Hint: why is it different from `x[NA_real_]`

?)

```
x <- 1:5
x[NA]
#> [1] NA NA NA NA NA
```

**A**: In contrast to `NA_real`

, `NA`

has logical type and logical vectors are recycled to the same length as the vector being subset, i.e. `x[NA]`

is recycled to `x[NA, NA, NA, NA, NA]`

.

**Q3**: What does `upper.tri()`

return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?

**A**: `upper.tri(x)`

returns a logical matrix, which contains `TRUE`

values above the diagonal and `FALSE`

values everywhere else. In `upper.tri()`

the positions for `TRUE`

and `FALSE`

values are determined by comparing `x`

’s row and column indices via `.row(dim(x)) < .col(dim(x))`

.

```
x
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 1 2 3 4 5
#> [2,] 2 4 6 8 10
#> [3,] 3 6 9 12 15
#> [4,] 4 8 12 16 20
#> [5,] 5 10 15 20 25
upper.tri(x)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] FALSE TRUE TRUE TRUE TRUE
#> [2,] FALSE FALSE TRUE TRUE TRUE
#> [3,] FALSE FALSE FALSE TRUE TRUE
#> [4,] FALSE FALSE FALSE FALSE TRUE
#> [5,] FALSE FALSE FALSE FALSE FALSE
```

When subsetting with logical matrices, all elements that correspond to `TRUE`

will be selected. Matrices extend vectors with a dimension attribute, so the vector forms of subsetting can be used (including logical subsetting). We should take care, that the dimensions of the subsetting matrix match the object of interest — otherwise unintended selections due to vector recycling may occur. Please also note, that this form of subsetting returns a vector instead of a matrix, as the subsetting alters the dimensions of the object.

```
x[upper.tri(x)]
#> [1] 2 3 6 4 8 12 5 10 15 20
```

**Q4**: Why does `mtcars[1:20]`

return an error? How does it differ from the similar `mtcars[1:20, ]`

?

**A**: When subsetting a data frame with a single vector, it behaves the same way as subsetting a list of columns. So, `mtcars[1:20]`

would return a data frame containing the first 20 columns of the dataset. However, as `mtcars`

has only 11 columns, the index will be out of bounds and an error is thrown. `mtcars[1:20, ]`

is subsetted with two vectors, so 2d subsetting kicks in, and the first index refers to rows.

**Q5**: Implement your own function that extracts the diagonal entries from a matrix (it should behave like `diag(x)`

where `x`

is a matrix).

**A**: The elements in the diagonal of a matrix have the same row- and column indices. This characteristic can be used to create a suitable numeric matrix used for subsetting.

```
diag2 <- function(x) {
n <- min(nrow(x), ncol(x))
idx <- cbind(seq_len(n), seq_len(n))
x[idx]
}
# Let's check if it works
(x <- matrix(1:30, 5))
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 1 6 11 16 21 26
#> [2,] 2 7 12 17 22 27
#> [3,] 3 8 13 18 23 28
#> [4,] 4 9 14 19 24 29
#> [5,] 5 10 15 20 25 30
diag(x)
#> [1] 1 7 13 19 25
diag2(x)
#> [1] 1 7 13 19 25
```

**Q6**: What does `df[is.na(df)] <- 0`

do? How does it work?

**A**: This expression replaces the `NA`

s in `df`

with `0`

. Here `is.na(df)`

returns a logical matrix that encodes the position of the missing values in `df`

. Subsetting and assignment are then combined to replace only the missing values.

## 3.2 Selecting a single element

**Q1**: Brainstorm as many ways as possible to extract the third value from the `cyl`

variable in the `mtcars`

dataset.

**A**: Base R already provides an abundance of possibilities:

```
# Select column first
mtcars$cyl[[3]]
#> [1] 4
mtcars[ , "cyl"][[3]]
#> [1] 4
mtcars[["cyl"]][[3]]
#> [1] 4
with(mtcars, cyl[[3]])
#> [1] 4
# Select row first
mtcars[3, ]$cyl
#> [1] 4
mtcars[3, "cyl"]
#> [1] 4
mtcars[3, ][ , "cyl"]
#> [1] 4
mtcars[3, ][["cyl"]]
#> [1] 4
# Select simultaneously
mtcars[3, 2]
#> [1] 4
mtcars[[c(2, 3)]]
#> [1] 4
```

**Q2**: Given a linear model, e.g. `mod <- lm(mpg ~ wt, data = mtcars)`

, extract the residual degrees of freedom. Extract the R squared from the model summary (`summary(mod)`

).

**A**: `mod`

is of type list, which opens up several possibilities. We us `$`

or `[[`

to extract a single element:

```
mod <- lm(mpg ~ wt, data = mtcars)
mod$df.residual
#> [1] 30
mod[["df.residual"]]
#> [1] 30
```

The same also applies to `summary(mod)`

, so we could use e.g.:

```
summary(mod)$r.squared
#> [1] 0.753
```

(Tip: The `{broom}`

package^{11} provides a very useful approach to work with models in a tidy way.)

## 3.3 Applications

**Q1**: How would you randomly permute the columns of a data frame? (This is an important technique in random forests.) Can you simultaneously permute the rows and columns in one step?

**A**: This can be achieved by combining `[`

and `sample()`

:

```
# Permute columns
mtcars[sample(ncol(mtcars))]
# Permute columns and rows in one step
mtcars[sample(nrow(mtcars)), sample(ncol(mtcars))]
```

**Q2**: How would you select a random sample of `m`

rows from a data frame? What if the sample had to be contiguous (i.e., with an initial row, a final row, and every row in between)?

**A**: Selecting `m`

random rows from a data frame can be achieved through subsetting.

Holding successive lines together as a blocked sample requires only a certain amount of caution in order to obtain the correct start and end index.

**Q3**: How could you put the columns in a data frame in alphabetical order?