9 Functionals

9.1 Prerequisites

library(purrr)

9.2 My first functional: map()

1. Q: Use as_mapper() to explore how purrr generates anonymous functions for the integer, character, and list helpers. What helper allows you to extract attributes? Read the documentation to find out.

A: map() offers multiple ways (functions, formulas and extractor functions) to specify the function argument (.f). Initially, the various inputs have to be transformed into a valid function, which is then applied. The creation of this valid function is the job of as_mapper() and it is called every time map() is used.

Given character, numeric or list input as_mapper() will create an extractor function. Characters select by name, while numeric input selects by positions and a list allows a mix of these two approaches. This extractor interface can be very useful, when working with nested data.

The extractor function is implemented as a call to purrr::pluck(), which accepts a list of accessors (accessors “access” some part of your data object).

as_mapper(c(1, 2))
#> function (x, ...)
#> pluck(x, list(1, 2), .default = NULL)
#> <environment: 0x2d7e130>
as_mapper(c("a", "b"))
#> function (x, ...)
#> pluck(x, list("a", "b"), .default = NULL)
#> <environment: 0x2fd5490>
as_mapper(list(1, "b"))
#> function (x, ...)
#> pluck(x, list(1, "b"), .default = NULL)
#> <environment: 0x29f52f0>

Besides mixing positions and names, it is also possible to pass along an accessor function. This is basically an anonymous function, that gets information about some aspect of the input data. You are free to define your own accessor functions.

If you need to access certain attributes, the helper attr_getter(y) is already predefined and will create the appropriate accessor function for you.

# define custom accessor function
get_class <- function(x) attr(x, "class")
pluck(mtcars, get_class)
#> [1] "data.frame"

# use attr_getter() as a helper
pluck(mtcars, attr_getter("class"))
#> [1] "data.frame"
2. Q: map(1:3, ~ runif(2)) is a useful pattern for generating random numbers, but map(1:3, runif(2)) is not. Why not? Can you explain why it returns the result that it does?

A: The first pattern creates random numbers, because ~ runif(2) successfully uses the formula interface. Internally map() applies as_mapper() to this formula, which converts ~ runif(2) into an anonymous function. Afterwards runif(2) is applied three times (one time during each iteration), leading to three different pairs of random numbers.

In the second pattern runif(2) is supplied as an atomic vector. Consequently as_mapper() creates an extractor function based on the return values from runif(2) (via pluck()). This leads to three NULLs (pluck()’s .default return), because no values corresponding to the index can be found.

map(1:3, ~ runif(2))  # uses formular interface
#> [[1]]
#> [1] 0.5680037 0.4628050
#>
#> [[2]]
#> [1] 0.07704971 0.34011013
#>
#> [[3]]
#> [1] 0.4236263 0.9469419
map(1:3, runif(2))  # uses extractor interface
#> [[1]]
#> NULL
#>
#> [[2]]
#> NULL
#>
#> [[3]]
#> NULL
3. Q: Use the appropriate map() function to:

1. Compute the standard deviation of every column in a numeric data frame.

2. Compute the standard deviation of every numeric column in a mixed data frame. (Hint: you’ll need to do it in two steps.)

3. Compute the number of levels for every factor in a data frame.

A: To solve this exercise we take advantage of calling the type stable variants of map(), which give us more concise output. We also use purrr::keep() to initially select the matching columns of the data frames. (keep() is introduced in the predicate functionals section of the “Functionals”-chapter).

map_dbl(mtcars, sd)
#>         mpg         cyl        disp          hp        drat          wt
#>   6.0269481   1.7859216 123.9386938  68.5628685   0.5346787   0.9784574
#>        qsec          vs          am        gear        carb
#>   1.7869432   0.5040161   0.4989909   0.7378041   1.6152000
map_dbl(keep(iris, is.numeric), sd)
#> Sepal.Length  Sepal.Width Petal.Length  Petal.Width
#>    0.8280661    0.4358663    1.7652982    0.7622377
map_int(keep(iris, is.factor), ~ length(levels(.x)))
#> Species
#>       3
4. Q: The following code simulates the performance of a t-test for non-normal data. Extract the p-value from each test, then visualise.

trials <- map(1:100, ~ t.test(rpois(10, 10), rpois(7, 10)))

A: pluck() allows us to elegantly extract the p-values. We can then pass a data frame directly to ggplot() for the visualisation. For randomly generated data, we can expect a uniform distribution of the p-values.

library(ggplot2)

map_dbl(trials, "p.value") %>%
tibble::tibble(p_value = .) %>%
ggplot(aes(x = p_value, fill = p_value < 0.05)) +
geom_dotplot(binwidth = .025) +
ggtitle("Distribution of p-values for random poisson data.")

1. Q: The following code uses a map nested inside another map to apply a function to every element of a nested list. Why does it fail, and what do you need to do to make it work?

x <- list(
list(1, c(3, 9)),
list(c(3, 6), 7, c(4, 7, 6))
)

triple <- function(x) x * 3
map(x, map, .f = triple)
#> Error in .f(.x[[i]], ...): unused argument (map)

A: This function call fails, because triple() is specified as the .f argument and consequently belongs to the outer map(). The unnamed argument map is treated as an argument of triple(), which causes the error.

If we switch the naming of the argument, the nested transformations work as expected:

map(x, .f = map, triple)
#> [[1]]
#> [[1]][[1]]
#> [1] 3
#>
#> [[1]][[2]]
#> [1]  9 27
#>
#>
#> [[2]]
#> [[2]][[1]]
#> [1]  9 18
#>
#> [[2]][[2]]
#> [1] 21
#>
#> [[2]][[3]]
#> [1] 12 21 18
2. Q: Use map() to fit linear models to the mtcars using the formulas stored in this list:

formulas <- list(
mpg ~ disp,
mpg ~ I(1 / disp),
mpg ~ disp + wt,
mpg ~ I(1 / disp) + wt
)

A: The data (mtars) is constant for all these models and we iterate over the formulas provided. Because the formula is the first argument of a lm()-call, it doesn’t need to be specified explicitly.

map(formulas, lm, data = mtcars) %>%
map(coef)  # shortens output
#> [[1]]
#> (Intercept)        disp
#> 29.59985476 -0.04121512
#>
#> [[2]]
#> (Intercept)   I(1/disp)
#>    10.75202  1557.67393
#>
#> [[3]]
#> (Intercept)        disp          wt
#> 34.96055404 -0.01772474 -3.35082533
#>
#> [[4]]
#> (Intercept)   I(1/disp)          wt
#>   19.024206 1142.559975   -1.797649
3. Q: Fit the model mpg ~ disp to each of the bootstrap replicates of mtcars in the list below, then extract the $$R^2$$ of the model fit (Hint: you can compute the $$R^2$$ with summary())

bootstrap <- function(df) {
df[sample(nrow(df), replace = TRUE), , drop = FALSE]
}

bootstraps <- map(1:10, ~ bootstrap(mtcars))

A: To accomplish this task, we take advantage of the “list in, list out”-functionality of map(). This allows us to chain multiple transformation together. We start by fitting the models. We then calculate the summaries and extract the $$R^2$$ values. For the last call we use map_dbl, which provides convenient output.

bootstraps %>%
map(~ lm(mpg ~ disp, data = .x)) %>%
map(summary) %>%
map_dbl("r.squared")
#>  [1] 0.7297584 0.7405662 0.7815003 0.6915903 0.7421542 0.7484293 0.6566547
#>  [8] 0.7246759 0.5515322 0.6910429

9.3 Map variants

1. Q: Explain the results of modify(mtcars, 1).

A: modify() is based on map(), and in this case, the extractor interface will be used. It extracts the first element of each column in mtcars. modify() always returns the same structure as its input: in this case it forces the first row to be recycled 32 times. (Internally modify() uses .x[] <- map(.x, .f, ...) for assignment.)

2. Q: Rewrite the following code to use iwalk() instead of walk2(). What are the advantages and disadvantages?

cyls <- split(mtcars, mtcars$cyl) paths <- file.path(temp, paste0("cyl-", names(cyls), ".csv")) walk2(cyls, paths, write.csv) A: With iwalk() it is possible combine the name creation and file saving with one function call. It is not necessary to create intermediate objects, which won’t be used any further. Unfortunately, the function turns out quite long and it may be difficult to understand it immediately. temp_dir <- tempfile() dir.create(temp_dir) split(mtcars, mtcars$cyl) %>%
iwalk(~ write.csv(.x, file.path(temp_dir, paste0("cyl-", .y, ".csv"))))

list.files(temp_dir)
#> [1] "cyl-4.csv" "cyl-6.csv" "cyl-8.csv"
3. Q: Explain how the following code transforms a data frame using functions stored in a list.

trans <- list(
disp = function(x) x * 0.0163871,
am = function(x) factor(x, labels = c("auto", "manual"))
)

vars <- names(trans)
mtcars[vars] <- map2(trans, mtcars[vars], function(f, var) f(var))

Compare and contrast the map2() approach to this map() approach:

mtcars[vars] <- map(vars, ~ trans[[.x]](mtcars[[.x]]))

A: In the map2() approach vars is first used to subset the .y argument. Then, map2() iterates over the similarly named variables and transformations, to build up and apply the transformations as anonymous functions. On the left hand side only the similarly named variables of mtcars is being replaced.

The map() variant does basically the same. However, it uses the vars object to directly iterate over similarly named elements of trans and mtcars.

Besides the iteration pattern, the approaches differ in the namings and the supplied argument types in the .f argument. While the .f in the map2() variant is named appropriately, the creation of the anonymous function might be a bit verbose. A more technical but compact way to write the .f argument would be the usage of a formula as in mtcars[vars] <- map2(trans, mtcars[vars], ~ .x(.y)). Of course, this leads to more technical names, as also in the map() variant, because of the default .x-named variables in purrr. However, in the map() approach the formula in the .f argument together with the shortage of the .x variable name and the brackets clearly indicate the application of functions to specific elements of mtcars. Together with the replacement form on the left hand side it becomes easy to inspect what the line is doing. Therefore, especially the iteration pattern on the left hand side comes in handy, as it clearly highlights the important matching of trans and mtcars element names.

4. Q: What does write.csv() return? i.e. what happens if you use it with map2() instead of walk2()?

A: write.csv() returns NULL. In the example above we iterated over a list of data frames and file names a named list of NULLs would be returned.

cyls <- split(mtcars, mtcars$cyl) paths <- file.path(temp_dir, paste0("cyl-", names(cyls), ".csv")) map2(cyls, paths, write.csv) #>$4
#> NULL
#>
#> $6 #> NULL #> #>$8
#> NULL

9.4 Predicate Functionals

1. Q: Why isn’t is.na() a predicate function? What base R function is closest to being a predicate version of is.na()?

A: is.na is not a predicate function, because a predicate function may only return TRUE or FALSE. This is not strictly the case for is.na (e.g. is.na(NULL) returns logical(0)). It may be, that anyNA(), if applied elementwise, may be closest to a predicate-is.na() in base R.

1. Q: simple_reduce() has a problem when x is length 0 or length 1. Describe the source of the problem and how you might go about fixing it.

simple_reduce <- function(x, f) {
out <- x[[1]]
for (i in seq(2, length(x))) {
out <- f(out, x[[i]])
}
out
}

A: The loop inside simple_reduce() always starts with the index 2. Therefore, subsetting length-0 and length-1 vectors via [[ will lead to the error subscript out of bounds. To avoid this, we allow simple_reduce() to return() before the for-loop is started and include default argument for 0-length vectors.

simple_reduce <- function(x, f, default) {
if(length(x) == 0L) return(default)
if(length(x) == 1L) return(x[[1L]])

out <- x[[1]]
for (i in seq(2, length(x))) {
out <- f(out, x[[i]])
}
out
}

Our new new simple_reduce() now works as intended:

simple_reduce(integer(0), +)
#> Error in simple_reduce(integer(0), +): argument "default" is missing, with no default
simple_reduce(integer(0), +, default = 0L)
#> [1] 0
simple_reduce(1, +)
#> [1] 1
simple_reduce(1:3, +)
#> [1] 6
2. Q: Implement the span() function from Haskell: given a list x and a predicate function f, span(x, f) returns the location of the longest sequential run of elements where the predicate is true. (Hint: you might find rle() helpful.)

A: Our span_r() function returns the first index of the longest sequential run of elements where the predicate is true. In case of more than one longest sequential, more than one index is returned.

span_r <- function(l, pred){
# We test if l is a list
stopifnot(is.list(l))

# we preallocate a logical vector and save the result
# of the predicate function applied to each element of the list
test <- vector("logical", length(l))
for (i in seq_along(l)){
test[i] <- (pred(l[[i]]))
}
# we return NA, if the output of pred is always FALSE
if(!any(test)) return(NA_integer_)

# Otherwise we look at the length encoding of TRUE and FALSE values.
rle_test <- rle(test)
# Since it might happen, that more than one maximum series of TRUE's appears,
# we have to implement some logic, which might be easier, if we save the rle
# output in a data.frmame
rle_test <- data.frame(lengths = rle_test[["lengths"]],
values = rle_test[["values"]],
cumsum = cumsum(rle_test[["lengths"]]))
rle_test[["first_index"]] <- rle_test[["cumsum"]] - rle_test[["lengths"]] + 1
# In the last line we calculated the first index in the original list for every encoding
# In the next line we calculate a column, which gives the maximum
# encoding length among all encodings with the value TRUE
rle_test[["max"]] <-  max(rle_test[rle_test[, "values"] == TRUE, ][,"lengths"])
# Now we just have to subset for maximum length among all TRUE values and return the
# according "first index":
rle_test[rle_test$lengths == rle_test$max & rle_test$values == TRUE, ]$first_index
}
3. Q: Implement arg_max(). It should take a function and a vector of inputs, and return the elements of the input where the function returns the highest value. For example, arg_max(-10:5, function(x) x ^ 2) should return -10. arg_max(-5:5, function(x) x ^ 2) should return c(-5, 5). Also implement the matching arg_min() function.

A: Both functions take a vector of inputs and a function as an argument. The functions output are then used to subset the input accordingly.

arg_max <- function(x, f){
x[f(x) == max(f(x))]
}

arg_min <- function(x, f){
x[f(x) == min(f(x))]
}

arg_max(-10:5, function(x) x ^ 2)
#> [1] -10
arg_min(-10:5, function(x) x ^ 2)
#> [1] 0

Both functions are actually quite similar, so it would have also been possible to pass an option (max or min) to an argument.

arg_ <- function(g, x, f){
x[f(x) == g(f(x))]
}

arg_(max, -10:5, function(x) x ^ 2)
#> [1] -10
arg_(min, -10:5, function(x) x ^ 2)
#> [1] 0
4. Q: The function below scales a vector so it falls in the range [0, 1]. How would you apply it to every column of a data frame? How would you apply it to every numeric column in a data frame?

scale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}

A: To apply a function to every function of a data frame, we can use purrr::modify, which also conveniently returns a data frame. To limit the application to numeric columns, the scoped versions modify_if() can be used.

modify_if(iris, is.numeric, scale01)

9.5 Base functionals

1. Q: How does apply() arrange the output? Read the documentation and perform some experiments.

A: apply() arranges its output columns (or list elements) according to the order of the margin. The rows are ordered by the other dimensions, starting with the “last” dimension of the input object. What this means should become clear by looking at the three and four dimensional cases of the following example:

# for two dimensional cases everything is sorted by the other dimension
arr2 <- array(1:9, dim = c(3, 3), dimnames = list(paste0("row", 1:3),
paste0("col", 1:3)))
arr2
apply(arr2, 1, head, 1) # Margin is row
apply(arr2, 1, head, 9) # sorts by col

apply(arr2, 2, head, 1) # Margin is col
apply(arr2, 2, head, 9) # sorts by row
# 3 dimensional
arr3 <- array(1:27, dim = c(3, 3, 3), dimnames = list(paste0("row", 1:3),
paste0("col", 1:3),
paste0("time", 1:3)))
arr3
apply(arr3, 1, head, 1) # Margin is row
apply(arr3, 1, head, 27) # sorts by time and col

apply(arr3, 2, head, 1) # Margin is col
apply(arr3, 2, head, 27) # sorts by time and row

apply(arr3, 3, head, 1) # Margin is time
apply(arr3, 3, head, 27) # sorts by col and row
# 4 dimensional
arr4 <- array(1:81, dim = c(3 ,3, 3, 3),
dimnames = list(paste0("row", 1:3),
paste0("col", 1:3),
paste0("time", 1:3),
paste0("var", 1:3)))
arr4

apply(arr4, 1, head, 1) # Margin is row
apply(arr4, 1, head, 81) # sorts by var, time, col

apply(arr4, 2, head, 1) # Margin is col
apply(arr4, 2, head, 81) # sorts by var, time, row

apply(arr4, 3, head, 1) # Margin is time
apply(arr4, 3, head, 81) # sorts by var, col, row

apply(arr4, 4, head, 1) # Margin is var
apply(arr4, 4, head, 81) # sorts by time, col, row
2. Q: What do eapply() and rapply() do? Does purrr have equivalents?

A: eapply() is a variant of lapply(), which iterates over the (named) elements of an environment. In purrr there is no equivalent for eapply() as purrr mainly provides functions that operate on vectors and functions, but not on environments.

rapply() applies a function to all elements of a list recursively. This function makes it possible to limit the application of the function to specified classes (default classes = ANY). One may also specify how elements of other classes should remain: i.e. as their identity (how = replace) or another value (deflt = NULL). Again purrr doesn’t provide an equivalent to this function.

3. Q: Challenge: read about the fixed point algorithm. Complete the exercises using R.

A: A number $$x$$ is called a fixed point of a function $$f$$, if it satisfies the equation $$f(x) = x$$. For some functions we may find a fixed point by beginning with a starting value and applying $$f$$ repeatedly. Here find_fixed_point() acts as a functional, because it takes a function as an argument.

find_fixed_point <- function(f, x_start = 1, n_max = 10000, tol = 0.0001) {
# Initialize
n <- 1
x <- x_start

while (n < n_max) {
# Compute function and test for fixed point quality
y <- f(x)
is_fixed_point <- all.equal(x, y, tolerance = tol)  == TRUE

if(is_fixed_point){
# Success case
message("Fixed point was found, after ", n, " iterations.")
return(x)
} else {
# Recursive case
x <- y
n <- n + 1
}
}
if(!is_fixed_point){
# Non-converging case
message("No fixed point found.")
}
}
# Functions with fixed points
find_fixed_point(sin, x_start = 1)
#> Fixed point was found, after 4994 iterations.
#> [1] 0.02449357
find_fixed_point(cos, x_start = 1)
#> Fixed point was found, after 24 iterations.
#> [1] 0.7390548

# Functions without fixed points
add_one <- function(x) x + 1
find_fixed_point(add_one, x_start = 1)
#> No fixed point found.

9.7 My first functional: lapply()

1. Q: Why are the following two invocations of lapply() equivalent?

trims <- c(0, 0.1, 0.2, 0.5)
x <- rcauchy(100)

lapply(trims, function(trim) mean(x, trim = trim))
lapply(trims, mean, x = x)

A: In the first statement each element of trims is explicitly supplied to mean()’s second argument. In the latter statement this happens via positional matching, since mean()’s first argument is supplied via name in lapply()’s third argument (...).

2. Q: The function below scales a vector so it falls in the range [0, 1]. How would you apply it to every column of a data frame? How would you apply it to every numeric column in a data frame?

scale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}

A: Since this function needs numeric input, one can check this via an if clause. If one also wants to return non-numeric input columns, these can be supplied to the else argument of the if() “function”:

data.frame(lapply(iris, function(x) if (is.numeric(x)) scale01(x) else x))
3. Q: Use both for loops and lapply() to fit linear models to the mtcars using the formulas stored in this list:

formulas <- list(
mpg ~ disp,
mpg ~ I(1 / disp),
mpg ~ disp + wt,
mpg ~ I(1 / disp) + wt
)

A: Like in the first exercise, we can create two lapply() versions:

# lapply (2 versions)
la1 <- lapply(formulas, lm, data = mtcars)
la2 <- lapply(formulas, function(x) lm(formula = x, data = mtcars))

# for loop
lf1 <- vector("list", length(formulas))
for (i in seq_along(formulas)){
lf1[[i]] <- lm(formulas[[i]], data = mtcars)
}

Note that all versions return the same content, but they won’t be identical, since the values of the “call” element will differ between each version.

4. Q: Fit the model mpg ~ disp to each of the bootstrap replicates of mtcars in the list below by using a for loop and lapply(). Can you do it without an anonymous function?

bootstraps <- lapply(1:10, function(i) {
rows <- sample(1:nrow(mtcars), rep = TRUE)
mtcars[rows, ]
})

A:

# lapply without anonymous function
la <- lapply(bootstraps, lm, formula = mpg ~ disp)

# for loop
lf <- vector("list", length(bootstraps))
for (i in seq_along(bootstraps)){
lf[[i]] <- lm(mpg ~ disp, data = bootstraps[[i]])
}
5. Q: For each model in the previous two exercises, extract $$R^2$$ using the function below.

rsq <- function(mod) summary(mod)$r.squared A: For the models in exercise 3: sapply(la1, rsq) #> [1] 0.7183433 0.8596865 0.7809306 0.8838038 sapply(la2, rsq) #> [1] 0.7183433 0.8596865 0.7809306 0.8838038 sapply(lf1, rsq) #> [1] 0.7183433 0.8596865 0.7809306 0.8838038 And the models in exercise 4: sapply(la, rsq) #> [1] 0.6072816 0.7024504 0.7025947 0.7343113 0.6970626 0.7281322 0.7215245 #> [8] 0.7281343 0.8069856 0.7422672 sapply(lf, rsq) #> [1] 0.6072816 0.7024504 0.7025947 0.7343113 0.6970626 0.7281322 0.7215245 #> [8] 0.7281343 0.8069856 0.7422672 9.8 For loops functionals: friends of lapply(): 1. Q: Use vapply() to: 1. Compute the standard deviation of every column in a numeric data frame. 2. Compute the standard deviation of every numeric column in a mixed data frame. (Hint: you’ll need to use vapply() twice.) A: As a numeric data.frame we choose cars: vapply(cars, sd, numeric(1)) And as a mixed data.frame we choose iris: vapply(iris[vapply(iris, is.numeric, logical(1))], sd, numeric(1)) 2. Q: Why is using sapply() to get the class() of each element in a data frame dangerous? A: Columns of data.frames might have more than one class, so the class of sapply()’s output may differ from time to time (silently). If … • all columns have one class: sapply() returns a character vector • one column has more classes than the others: sapply() returns a list • all columns have the same number of classes, which is more than one: sapply() returns a matrix For example: a <- letters[1:3] class(a) <- c("class1", "class2") df <- data.frame(a = character(3)) df$a <- a
df$b <- a class(sapply(df, class)) #> [1] "matrix" Note that this case often appears, wile working with the POSIXt types, POSIXct and POSIXlt. 3. Q: The following code simulates the performance of a t-test for non-normal data. Use sapply() and an anonymous function to extract the p-value from every trial. trials <- replicate( 100, t.test(rpois(10, 10), rpois(7, 10)), simplify = FALSE ) Extra challenge: get rid of the anonymous function by using [[ directly. A: # anonymous function: sapply(trials, function(x) x[["p.value"]]) # without anonymous function: sapply(trials, "[[", "p.value") 4. Q: What does replicate() do? What sort of for loop does it eliminate? Why do its arguments differ from lapply() and friends? A: As stated in ?replicate: replicate is a wrapper for the common use of sapply for repeated evaluation of an expression (which will usually involve random number generation). We can see this clearly in the source code: #> function (n, expr, simplify = "array") #> sapply(integer(n), eval.parent(substitute(function(...) expr)), #> simplify = simplify) #> <bytecode: 0x486dbf0> #> <environment: namespace:base> Like sapply() replicate() eliminates a for loop. As explained for Map() in the textbook, also every replicate() could have been written via lapply(). But using replicate() is more concise, and more clearly indicates what you’re trying to do. 5. Q: Implement a version of lapply() that supplies FUN with both the name and the value of each component. A: lapply_nms <- function(X, FUN, ...){ Map(FUN, X, names(X), ...) } lapply_nms(iris, function(x, y) c(class(x), y)) #>$Sepal.Length
#> [1] "numeric"      "Sepal.Length"
#>
#> $Sepal.Width #> [1] "numeric" "Sepal.Width" #> #>$Petal.Length
#> [1] "numeric"      "Petal.Length"
#>
#> $Petal.Width #> [1] "numeric" "Petal.Width" #> #>$Species
#> [1] "factor"  "Species"
6. Q: Implement a combination of Map() and vapply() to create an lapply() variant that iterates in parallel over all of its inputs and stores its outputs in a vector (or a matrix). What arguments should the function take?

A As we understand this exercise, it is about working with a list of lists, like in the following example:

testlist <- list(iris, mtcars, cars)
lapply(testlist, function(x) vapply(x, mean, numeric(1)))
#> Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
#> returning NA

#> Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
#> returning NA
#> [[1]]
#> Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species
#>     5.843333     3.057333     3.758000     1.199333           NA
#>
#> [[2]]
#>        mpg        cyl       disp         hp       drat         wt
#>  20.090625   6.187500   3.780862 146.687500   3.596563   3.217250
#>       qsec         vs         am       gear       carb
#>  17.848750   0.437500         NA   3.687500   2.812500
#>
#> [[3]]
#> speed  dist
#> 15.40 42.98

So we can get the same result with a more specialized function:

lmapply <- function(X, FUN, FUN.VALUE, simplify = FALSE){
out <- Map(function(x) vapply(x, FUN, FUN.VALUE), X)
if(simplify == TRUE){return(simplify2array(out))}
out
}

lmapply(testlist, mean, numeric(1))
#> Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
#> returning NA

#> Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
#> returning NA
#> [[1]]
#> Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species
#>     5.843333     3.057333     3.758000     1.199333           NA
#>
#> [[2]]
#>        mpg        cyl       disp         hp       drat         wt
#>  20.090625   6.187500   3.780862 146.687500   3.596563   3.217250
#>       qsec         vs         am       gear       carb
#>  17.848750   0.437500         NA   3.687500   2.812500
#>
#> [[3]]
#> speed  dist
#> 15.40 42.98
7. Q: Implement mcsapply(), a multi-core version of sapply(). Can you implement mcvapply(), a parallel version of vapply()? Why or why not?

9.9 Manipulating matrices and data frames

1. Q: How does apply() arrange the output? Read the documentation and perform some experiments.

A:

apply() arranges its output columns (or list elements) according to the order of the margin. The rows are ordered by the other dimensions, starting with the “last” dimension of the input object. What this means should become clear by looking at the three and four dimensional cases of the following example:

# for two dimensional cases everything is sorted by the other dimension
arr2 <- array(1:9, dim = c(3, 3), dimnames = list(paste0("row", 1:3),
paste0("col", 1:3)))
arr2
apply(arr2, 1, head, 1) # Margin is row
apply(arr2, 1, head, 9) # sorts by col

apply(arr2, 2, head, 1) # Margin is col
apply(arr2, 2, head, 9) # sorts by row

# 3 dimensional
arr3 <- array(1:27, dim = c(3,3,3), dimnames = list(paste0("row", 1:3),
paste0("col", 1:3),
paste0("time", 1:3)))
arr3
apply(arr3, 1, head, 1) # Margin is row
apply(arr3, 1, head, 27) # sorts by time and col

apply(arr3, 2, head, 1) # Margin is col
apply(arr3, 2, head, 27) # sorts by time and row

apply(arr3, 3, head, 1) # Margin is time
apply(arr3, 3, head, 27) # sorts by col and row

# 4 dimensional
arr4 <- array(1:81, dim = c(3,3,3,3), dimnames = list(paste0("row", 1:3),
paste0("col", 1:3),
paste0("time", 1:3),
paste0("var", 1:3)))
arr4

apply(arr4, 1, head, 1) # Margin is row
apply(arr4, 1, head, 81) # sorts by var, time, col

apply(arr4, 2, head, 1) # Margin is col
apply(arr4, 2, head, 81) # sorts by var, time, row

apply(arr4, 3, head, 1) # Margin is time
apply(arr4, 3, head, 81) # sorts by var, col, row

apply(arr4, 4, head, 1) # Margin is var
apply(arr4, 4, head, 81) # sorts by time, col, row
2. Q: There’s no equivalent to split() + vapply(). Should there be? When would it be useful? Implement one yourself.

A: We can modify the tapply2() approach from the book, where split() and sapply() were combined:

v_tapply <- function(x, group, f, FUN.VALUE, ..., USE.NAMES = TRUE) {
pieces <- split(x, group)
vapply(pieces, f, FUN.VALUE, ..., USE.NAMES = TRUE)
}

tapply() has a SIMPLIFY argument. When you set it to FALSE, tapply() will always return a list. It is easy to create cases where the length and the types/classes of the list elements vary depending on the input. The vapply() version could be useful, if you want to control the structure of the output to get an error according to some logic of a specific usecase or you want typestable output to build up other functions on top of it.

3. Q: Implement a pure R version of split(). (Hint: use unique() and subsetting.) Can you do it without a for loop?

A:

split2 <- function(x, f, drop = FALSE, ...){
# there are three relevant cases for f. f is a character, f is a factor and all
# levels occur, f is a factor and some levels don't occur.

# first we check if f is a factor
fact <- is.factor(f)

# if drop it set to TRUE, we drop the non occuring levels.
# (If f is a character, this has no effect.)
if(drop){f <- f[, drop = TRUE]}

# now we want all unique elements/levels of f
levs <- if (fact) {unique(levels(f))} else {as.character(unique(f))}

# we use these levels to subset x and supply names for the resulting output.
setNames(lapply(levs, function(lv) x[f == lv, , drop = FALSE]), levs)
}
4. Q: What other types of input and output are missing? Brainstorm before you look up some answers in the plyr paper.

A: From the suggested plyr paper, we can extract a lot of possible combinations and list them up on a table. Sean C. Anderson already has done this based on a presentation from Hadley Wickham and provided the following result here.

object type array data frame list nothing
array apply . . .
data frame . aggregate by .
list sapply . lapply .
n replicates replicate . replicate .
function arguments mapply . mapply .

Note the column nothing, which is specifically for usecases, where sideeffects like plotting or writing data are intended.

9.10 Manipulating lists

1. Q: Why isn’t is.na() a predicate function? What base R function is closest to being a predicate version of is.na()?

A: Because a predicate function always returns TRUE or FALSE. is.na(NULL) returns logical(0), which excludes it from being a predicate function. The closest in base that we are aware of is anyNA(), if one applies it elementwise.

2. Q: Use Filter() and vapply() to create a function that applies a summary statistic to every numeric column in a data frame.

A:

vapply_num <- function(X, FUN, FUN.VALUE){
vapply(Filter(is.numeric, X), FUN, FUN.VALUE)
}
3. Q: What’s the relationship between which() and Position()? What’s the relationship between where() and Filter()?

A: which() returns all indices of true entries from a logical vector. Position() returns just the first (default) or the last integer index of all true entries that occur by applying a predicate function on a vector. So the default relation is Position(f, x) <=> min(which(f(x))).

where(), defined in the book as:

where <- function(f, x) {
vapply(x, f, logical(1))
} 

is useful to return a logical vector from a condition asked on elements of a list or a data frame. Filter(f, x) returns all elements of a list or a data frame, where the supplied predicate function returns TRUE. So the relation is Filter(f, x) <=> x[where(f, x)].

4. Q: Implement Any(), a function that takes a list and a predicate function, and returns TRUE if the predicate function returns TRUE for any of the inputs. Implement All() similarly.

A: Any():

Any <- function(l, pred){
stopifnot(is.list(l))

for (i in seq_along(l)){
if (pred(l[[i]])) return(TRUE)
}

return(FALSE)
}

All():

All <- function(l, pred){
stopifnot(is.list(l))

for (i in seq_along(l)){
if (!pred(l[[i]])) return(FALSE)
}

return(TRUE)
}
5. Q: Implement the span() function from Haskell: given a list x and a predicate function f, span returns the location of the longest sequential run of elements where the predicate is true. (Hint: you might find rle() helpful.)

A: Our span_r() function returns the first index of the longest sequential run of elements where the predicate is true. In case of more than one longest sequenital, more than one first_index is returned.

span_r <- function(l, pred){
# We test if l is a list
stopifnot(is.list(l))

# we preallocate a logical vector and save the result
# of the predicate function applied to each element of the list
test <- vector("logical", length(l))
for (i in seq_along(l)){
test[i] <- (pred(l[[i]]))
}
# we return NA, if the output of pred is always FALSE
if(!any(test)) return(NA_integer_)

# Otherwise we look at the length encoding of TRUE and FALSE values.
rle_test <- rle(test)
# Since it might happen, that more than one maximum series of TRUE's appears,
# we have to implement some logic, which might be easier, if we save the rle
# output in a data.frmame
rle_test <- data.frame(lengths = rle_test[["lengths"]],
values = rle_test[["values"]],
cumsum = cumsum(rle_test[["lengths"]]))
rle_test[["first_index"]] <- rle_test[["cumsum"]] - rle_test[["lengths"]] + 1
# In the last line we calculated the first index in the original list for every encoding
# In the next line we calculate a column, which gives the maximum
# encoding length among all encodings with the value TRUE
rle_test[["max"]] <-  max(rle_test[rle_test[, "values"] == TRUE, ][,"lengths"])
# Now we just have to subset for maximum length among all TRUE values and return the
# according "first index":
rle_test[rle_test$lengths == rle_test$max & rle_test$values == TRUE, ]$first_index
}

9.11 List of functions

1. Q: Implement a summary function that works like base::summary(), but uses a list of functions. Modify the function so it returns a closure, making it possible to use it as a function factory.

2. Q: Which of the following commands is equivalent to with(x, f(z))?

1. x$f(x$z).
2. f(x$z). 3. x$f(z).
4. f(z).
5. It depends.

9.12 Mathematical functionals

1. Q: Implement arg_max(). It should take a function and a vector of inputs, and return the elements of the input where the function returns the highest value. For example, arg_max(-10:5, function(x) x ^ 2) should return -10. arg_max(-5:5, function(x) x ^ 2) should return c(-5, 5). Also implement the matching arg_min() function.

A: arg_max():

arg_max <- function(x, f){
x[f(x) == max(f(x))]
}

arg_min():

arg_min <- function(x, f){
x[f(x) == min(f(x))]
}
2. Q: Challenge: read about the fixed point algorithm. Complete the exercises using R.

9.13 A family of functions

1. Q: Implement smaller and larger functions that, given two inputs, return either the smaller or the larger value. Implement na.rm = TRUE: what should the identity be? (Hint: smaller(x, smaller(NA, NA, na.rm = TRUE), na.rm = TRUE) must be x, so smaller(NA, NA, na.rm = TRUE) must be bigger than any other value of x.) Use smaller and larger to implement equivalents of min(), max(), pmin(), pmax(), and new functions row_min() and row_max().

A: We can do almost everything as shown in the case study in the textbook. First we define the functions smaller_() and larger_(). We use the underscore suffix, to built up non suffixed versions on top, which will include the na.rm parameter. In contrast to the add() example from the book, we change two things at this step. We won’t include errorchecking, since this is done later at the top level and we return NA_integer_ if any of the arguments is NA (this is important, if na.rm is set to FALSE and wasn’t needed by the add() example, since + already returns NA in this case.)

smaller_ <- function(x, y){
if(anyNA(c(x, y))){return(NA_integer_)}
out <- x
if(y < x) {out <- y}
out
}

larger_ <- function(x, y){
if(anyNA(c(x, y))){return(NA_integer_)}
out <- x
if(y > x) {out <- y}
out
}

We can take na.rm() from the book:

rm_na <- function(x, y, identity) {
if (is.na(x) && is.na(y)) {
identity
} else if (is.na(x)) {
y
} else {
x
}
}

To find the identity value, we can apply the same argument as in the textbook, hence our functions are also associative and the following equation should hold:

3 = smaller(smaller(3, NA), NA) = smaller(3, smaller(NA, NA)) = 3

So the identidy has to be greater than 3. When we generalize from 3 to any real number this means that the identity has to be greater than any number, which leads us to infinity. Hence identity has to be Inf for smaller() (and -Inf for larger()), which we implement next:

smaller <- function(x, y, na.rm = FALSE) {
stopifnot(length(x) == 1, length(y) == 1, is.numeric(x) | is.logical(x),
is.numeric(y) | is.logical(y))
if (na.rm && (is.na(x) || is.na(y))) rm_na(x, y, Inf) else smaller_(x,y)
}

larger <- function(x, y, na.rm = FALSE) {
stopifnot(length(x) == 1, length(y) == 1, is.numeric(x) | is.logical(x),
is.numeric(y) | is.logical(y))
if (na.rm && (is.na(x) || is.na(y))) rm_na(x, y, -Inf) else larger_(x,y)
}

Like min() and max() can act on vectors, we can implement this easyly for our new functions. As shown in the book, we also have to set the init parameter to the identity value.

r_smaller <- function(xs, na.rm = TRUE) {
Reduce(function(x, y) smaller(x, y, na.rm = na.rm), xs, init = Inf)
}
# some tests
r_smaller(c(1:3, 4:(-1)))
#> [1] -1
r_smaller(NA, na.rm = TRUE)
#> [1] Inf
r_smaller(numeric())
#> [1] Inf

r_larger <- function(xs, na.rm = TRUE) {
Reduce(function(x, y) larger(x, y, na.rm = na.rm), xs, init = -Inf)
}
# some tests
r_larger(c(1:3), c(4:1))
#> [1] 3
r_larger(NA, na.rm = TRUE)
#> [1] -Inf
r_larger(numeric())
#> [1] -Inf

We can also create vectorised versions as shown in the book. We will just show the smaller() case to become not too verbose.

v_smaller1 <- function(x, y, na.rm = FALSE){
stopifnot(length(x) == length(y), is.numeric(x) | is.logical(x),
is.numeric(y)| is.logical(x))
if (length(x) == 0) return(numeric())
simplify2array(
Map(function(x, y) smaller(x, y, na.rm = na.rm), x, y)
)
}

v_smaller2 <- function(x, y, na.rm = FALSE) {
stopifnot(length(x) == length(y), is.numeric(x) | is.logical(x),
is.numeric(y)| is.logical(x))
vapply(seq_along(x), function(i) smaller(x[i], y[i], na.rm = na.rm),
numeric(1))
}

# Both versions give the same results
v_smaller1(1:10, c(2,1,4,3,6,5,8,7,10,9))
#>  [1] 1 1 3 3 5 5 7 7 9 9
v_smaller2(1:10, c(2,1,4,3,6,5,8,7,10,9))
#>  [1] 1 1 3 3 5 5 7 7 9 9

v_smaller1(numeric(), numeric())
#> numeric(0)
v_smaller2(numeric(), numeric())
#> numeric(0)

v_smaller1(c(1, NA), c(1, NA), na.rm = FALSE)
#> [1]  1 NA
v_smaller2(c(1, NA), c(1, NA), na.rm = FALSE)
#> [1]  1 NA

v_smaller1(NA,NA)
#> [1] NA
v_smaller2(NA,NA)
#> [1] NA

Of course, we are also able to copy paste the rest from the textbook, to solve the last part of the exercise:

row_min <- function(x, na.rm = FALSE) {
apply(x, 1, r_smaller, na.rm = na.rm)
}
col_min <- function(x, na.rm = FALSE) {
apply(x, 2, r_smaller, na.rm = na.rm)
}
arr_min <- function(x, dim, na.rm = FALSE) {
apply(x, dim, r_smaller, na.rm = na.rm)
}
2. Q: Create a table that has and, or, add, multiply, smaller, and larger in the columns and binary operator, reducing variant, vectorised variant, and array variants in the rows.

1. Fill in the cells with the names of base R functions that perform each of the roles.

2. Compare the names and arguments of the existing R functions. How consistent are they? How could you improve them?

3. Complete the matrix by implementing any missing functions.

A In the following table we can see the requested base R functions, that we are aware of:

and or add multiply smaller larger
binary && ||
reducing all any sum prod min max
vectorised & | + * pmin pmax
array

Notice that we were relatively strict about the binary row. Since the vectorised and reducing versions are more general, then the binary versions, we could have used them twice. However, this doesn’t seem to be the intention of this exercise.

The last part of this exercise can be solved via copy pasting from the book and the last exercise for the binary row and creating combinations of apply() and the reducing versions for the array row. We think the array functions just need a dimension and an rm.na argument. We don’t know how we would name them, but sth. like sum_array(1, na.rm = TRUE) could be ok.

The second part of the exercise is hard to solve complete. But in our opinion, there are two important parts. The behaviour for special inputs like NA, NaN, NULL and zero length atomics should be consistent and all versions should have a rm.na argument, for which the functions also behave consistent. In the follwing table, we return the output of f(x, 1), where f is the function in the first column and x is the special input in the header (the named functions also have an rm.na argument, which is FALSE by default). The order of the arguments is important, because of lazy evaluation.

NA NaN NULL logical(0) integer(0)
&& NA NA error NA NA
all NA NA TRUE TRUE TRUE
& NA NA error logical(0) logical(0)
|| TRUE TRUE error TRUE TRUE
any TRUE TRUE TRUE TRUE TRUE
| TRUE TRUE error logical(0) logical(0)
sum NA NaN 1 1 1
+ NA NaN numeric(0) numeric(0) numeric(0)
prod NA NaN 1 1 1
* NA NaN numeric(0) numeric(0) numeric(0)
min NA NaN 1 1 1
pmin NA NaN numeric(0) numeric(0) numeric(0)
max NA NaN 1 1 1
pmax NA NaN numeric(0) numeric(0) numeric(0)

We can see, that the vectorised and reduced numerical functions are all consistent. However it is not, that the first three logical functions return NA for NA and NaN, while the 4th till 6th function all return TRUE. Then FALSE would be more consistent for the first three or the return of NA for all and an extra na.rm argument. In seems relatively hard to find an easy rule for all cases and especially the different behaviour for NULL is relatively confusing. Another good opportunity for sorting the functions would be to differentiate between “numerical” and “logical” operators first and then between binary, reduced and vectorised, like below (we left the last colum, which is redundant, because of coercion, as intended):

f(x,1) NA NaN NULL logical(0)
&& NA NA error NA
|| TRUE TRUE error TRUE
all NA NA TRUE TRUE
any TRUE TRUE TRUE TRUE
& NA NA error logical(0)
| TRUE TRUE error logical(0)
sum NA NaN 1 1
prod NA NaN 1 1
min NA NaN 1 1
max NA NaN 1 1
+ NA NaN numeric(0) numeric(0)
* NA NaN numeric(0) numeric(0)
pmin NA NaN numeric(0) numeric(0)
pmax NA NaN numeric(0) numeric(0)

The other point are the naming conventions. We think they are clear, but it could be useful to provide the missing binary operators and name them for example ++, **, <>, >< to be consistent.

3. Q: How does paste() fit into this structure? What is the scalar binary function that underlies paste()? What are the sep and collapse arguments to paste() equivalent to? Are there any paste variants that don’t have existing R implementations?

A paste() behaves like a mix. If you supply only length one arguments, it will behave like a reducing function, i.e. :

paste("a", "b", sep = "")
#> [1] "ab"
paste("a", "b","", sep = "")
#> [1] "ab"

If you supply at least one element with length greater then one, it behaves like a vectorised function, i.e. :

paste(1:3)
#> [1] "1" "2" "3"
paste(1:3, 1:2)
#> [1] "1 1" "2 2" "3 1"
paste(1:3, 1:2, 1)
#> [1] "1 1 1" "2 2 1" "3 1 1"

We think it should be possible to implement a new paste() starting from

p_binary <- function(x, y = "") {
stopifnot(length(x) == 1, length(y) == 1)
paste0(x,y)
}

The sep argument is equivalent to bind sep on every ... input supplied to paste(), but the last and then bind these results together. In relations:

paste(n1, n2, ...,nm , sep = sep) <=>
paste0(paste0(n1, sep), paste(n2, n3, ..., nm, sep = sep)) <=>
paste0(paste0(n1, sep), paste0(n2, sep), ..., paste0(nn, sep), paste0(nm))

We can check this for scalar and non scalar input

# scalar:
paste("a", "b", "c", sep = "_")
#> [1] "a_b_c"
paste0(paste0("a", "_"), paste("b", "c", sep = "_"))
#> [1] "a_b_c"
paste0(paste0("a", "_"), paste0("b", "_"), paste0("c"))
#> [1] "a_b_c"

# non scalar
paste(1:2, "b", "c", sep = "_")
#> [1] "1_b_c" "2_b_c"
paste0(paste0(1:2, "_"), paste("b", "c", sep = "_"))
#> [1] "1_b_c" "2_b_c"
paste0(paste0(1:2, "_"), paste0("b", "_"), paste0("c"))
#> [1] "1_b_c" "2_b_c"

collapse just binds the outputs for non scalar input together with the collapse input. In relations:

for input A1, ..., An, where Ai = a1i:ami,

paste(A1 , A2 , ...,  An, collapse = collapse)
<=>
paste0(
paste0(paste(  a11,   a12, ...,   a1n), collapse),
paste0(paste(  a21,   a22, ...,   a2n), collapse),
.................................................
paste0(paste(am-11, am-12, ..., am-1n), collapse),
paste(  am1,   am2, ...,   amn)
)

One can see this easily by intuition from examples:

paste(1:5, 1:5, 6, sep = "", collapse = "_x_")
#> [1] "116_x_226_x_336_x_446_x_556"
paste(1,2,3,4, collapse = "_x_")
#> [1] "1 2 3 4"
paste(1:2,1:2,2:3,3:4, collapse = "_x_")
#> [1] "1 1 2 3_x_2 2 3 4"

We think the only paste version that is not implemented in base R is an array version. At least we are not aware of sth. like row_paste or paste_apply etc.