# 10 Function factories

## 10.1 Prerequisites

For most of this chapter base R is sufficient. Just a few exercises require **rlang** and **ggplot2** packages.

```
```r
library(rlang)
library(ggplot2)
#> Registered S3 methods overwritten by 'ggplot2':
#> method from
#> [.quosures rlang
#> c.quosures rlang
#> print.quosures rlang
```
```

## 10.2 Factory fundamentals

**Q**: The definition of`force()`

is simple:`force #> function (x) #> x #> <bytecode: 0xf04508> #> <environment: namespace:base>`

Why is it better to

`force(x)`

instead of just`x`

?**A**: As you can see`force(x)`

is just syntactic sugar for`x`

. We prefer this explicit form, becauseusing this function clearly indicates that you’re forcing evaluation, not that you’ve accidentally typed

`x`

. (Quote from the textbook)**Q**: Base R contains two function factories,`approxfun()`

and`ecdf()`

. Read their documentation and experiment to figure out what the functions do and what they return.**A**: Let’s begin with`approxfun()`

as it is used within`ecdf()`

also:`approxfun()`

takes a 2-dimensional combination of data points (`x`

and`y`

) as input and returns a*stepwise interpolation function*, which transforms new`x`

values. Additional arguments control how the created function should behave. (The interpolation`method`

may be linear or constant.`yleft`

,`yright`

and`rule`

specify how the newly created function should map new values which are outside of`range(x)`

.`f`

controls the degree of right-left-continuity via a numeric value from`0`

to`1`

and`ties`

expects a function name like min, mean etc. which defines how non-unique x-y-combinations should be handled when interpolating the data points.)`ecdf()`

is an acronym for empirical cumulative distribution function. For a numeric vector,`ecdf()`

returns the appropriate distribution function (of class “ecdf”, which is inheriting from class “stepfun”). Initially the (x, y) pairs for the nodes of the density function are calculated. Afterwards these pairs are passed to`approxfun()`

, which then returns the desired function.

**Q**: Create a function`pick()`

that takes an index,`i`

, as an argument and returns a function with an argument`x`

that subsets`x`

with`i`

.`pick(1)(x) # should be equivalent to x[[1]] lapply(mtcars, pick(5)) # should be equivalent to lapply(mtcars, function(x) x[[5]])`

**A**: In this exercise`pick(i)`

acts as a function factory, which returns the required subsetting function.`pick <- function(i) { force(i) function(x) x[[i]] } x <- 1:3 identical(x[[1]], pick(1)(x)) #> [1] TRUE identical(lapply(mtcars, function(x) x[[5]]), lapply(mtcars, pick(5))) #> [1] TRUE`

**Q**: Create a function that creates functions that compute the i^{th}central moment of a numeric vector. You can test it by running the following code:`m1 <- moment(1) m2 <- moment(2) x <- runif(100) stopifnot(all.equal(m1(x), 0)) stopifnot(all.equal(m2(x), var(x) * 99 / 100))`

**A**: The first moment is closely related to the mean and describes the average deviation from the mean, which is 0 (within numerical margin of error). The second moment describes the variance of the input data. If we want compare it to`var`

, we need to undo [Bessel’s correction}(https://en.wikipedia.org/wiki/Bessel%27s_correction) correction by multiplying with \(\frac{N-1}{N}\).`moment <- function(i){ force(i) function(x) sum((x - mean(x)) ^ i) / length(x) } m1 <- moment(1) m2 <- moment(2) x <- runif(100) all.equal(m1(x), 0) # removed stopifnot() for clarity #> [1] TRUE all.equal(m2(x), var(x) * 99 / 100) #> [1] TRUE`

**Q**: What happens if you don’t use a closure? Make predictions, then verify with the code below.`i <- 0 new_counter2 <- function() { i <<- i + 1 i }`

**A**: Without the captured and encapsulated environment of a closure the counts will be stored in the global environment. Here they can be overwritten or deleted as well as interfere with other counters.`new_counter2() #> [1] 1 new_counter2() #> [1] 2 i <- 0 new_counter2() #> [1] 1`

**Q**: What happens if you use`<-`

instead of`<<-`

? Make predictions, then verify with the code below.`new_counter3 <- function() { i <- 0 function() { i <- i + 1 i } }`

**A**: Without the super assignment`<<-`

, the counter will always return 1. The counter always starts in a new execution environment within the same enclosing environment, which contains an unchanged value for`i`

(in this case it remains 0).`new_counter_3 <- new_counter3() new_counter_3() #> [1] 1 new_counter_3() #> [1] 1`

## 10.3 Graphical factories

**Q**: Compare and contrast`ggplot2::label_bquote()`

with`scales::number_format()`

.**A**:

## 10.4 Statistical factories

**Q**: In`boot_model()`

, why don’t I need to force the evaluation of`df`

or`model`

?**A**:`boot_model()`

ultimately returns a function.`boot_model <- function(df, formula) { mod <- lm(formula, data = df) fitted <- unname(fitted(mod)) resid <- unname(resid(mod)) rm(mod) function() { fitted + sample(resid) } }`

In this function neither

`df`

nor`model`

are used. The relevant values,`fitted`

and`resid`

, have been calculated (and therefore evaluated) in the enclosing environment of`boot_model`

.**Q**: Why might you formulate the Box-Cox transformation like this?`boxcox3 <- function(x) { function(lambda) { if (lambda == 0) { log(x) } else { (x ^ lambda - 1) / lambda } } }`

**A**:`boxcox3`

returns a function where`x`

is fixed (though it is not forced, so it may manipulated later). This allows us to apply and test different transformations for different inputs and give them a descriptive name.`library(purrr) #> #> Attaching package: 'purrr' #> The following objects are masked from 'package:rlang': #> #> %@%, as_function, flatten, flatten_chr, flatten_dbl, #> flatten_int, flatten_lgl, flatten_raw, invoke, list_along, #> modify, prepend, splice # initial example (should be improved) boxcox_airpassengers <- boxcox3(AirPassengers) par(mfrow=c(2, 2)) purrr::map(c(0.1, 0.3, 0.5, 0.8), boxcox_airpassengers) %>% purrr::walk(plot) dev.off() #> null device #> 1`

**Q**: Why don’t you need to worry that`boot_permute()`

stores a copy of the data inside the function that it generates?**A**: The objects created in a manufactured function, like those returned by`boot_compute()`

, are created in the function’s execution environment. Execution environments are ephemeral: once the function is executed, the environment will be garbage collected. As long as we don’t return the execution environment and assign the referencing object, we won’t run into problems.

**Q**: How much time does`ll_poisson2()`

save compared to`ll_poisson1()`

? Use`bench::mark()`

to see how much faster the optimisation occurs. How does changing the length of`x`

change the results?**A**: Let us recall the definitions of`ll_poisson1()`

and`ll_poisson2()`

and the test data`x1`

:`ll_poisson1 <- function(x) { n <- length(x) function(lambda) { log(lambda) * sum(x) - n * lambda - sum(lfactorial(x)) } } ll_poisson2 <- function(x) { n <- length(x) sum_x <- sum(x) c <- sum(lfactorial(x)) function(lambda) { log(lambda) * sum_x - n * lambda - c } } # provided test data x1 <- c(41, 30, 31, 38, 29, 24, 30, 29, 31, 38)`

A benchmark with this data reveals a performance improvement of factor 2 for

`ll_poisson2()`

over`ll_poisson1()`

`bench::mark(llp1 = optimise(ll_poisson1(x1), c(0, 100), maximum = TRUE), llp2 = optimise(ll_poisson2(x1), c(0, 100), maximum = TRUE)) #> # A tibble: 2 x 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> #> 1 llp1 31.6µs 37.4µs 25453. 12.8KB 20.4 #> 2 llp2 14.9µs 17.9µs 52251. 0B 20.9`

Regarding differing lengths of

`x1`

, we expect even further performance improvements of`ll_poisson2()`

compared to`ll_poisson1()`

, as the redundant calculations within`ll_poisson1()`

, become more expensive with growing length of`x1`

. The following results imply that for a length of`x1`

of 100000,`ll_poisson2()`

is about 20+ times as fast as`ll_poisson1()`

:`library(purrr) library(dplyr) bench_poisson <- function(i){ x_i_length <- 10L ^ i x_i <- rpois(x_i_length, 100L) rel_advantage <- bench::mark(llp1 = optimise(ll_poisson1(x_i), c(0, 100), maximum = TRUE), llp2 = optimise(ll_poisson2(x_i), c(0, 100), maximum = TRUE), relative = TRUE)$median %>% max() rel_advantage } bench_df <- map_dbl(1:5, bench_poisson) %>% tibble(i = 1:5, rel_advantage = ., x_length = 10 ^ i) bench_df %>% ggplot(aes(x_length, rel_advantage)) + geom_point() + geom_line() + ggtitle("Rel. Speed of ll_poisson2() increases with vector length") + scale_x_log10()`

## 10.5 Function factories + functionals

**Q**: Which of the following commands is equivalent to`with(x, f(z))`

?`x$f(x$z)`

.`f(x$z)`

.`x$f(z)`

.`f(z)`

.- It depends.

**A**: (e) “It depends” is the correct answer. Usually`with()`

is used with a data frame, so you’d usually expect (b), but if`x`

is a list, it could be any of the options.`f <- mean z <- 1 x <- list(f = mean, z = 1) identical(with(x, f(z)), x$f(x$z)) #> [1] TRUE identical(with(x, f(z)), f(x$z)) #> [1] TRUE identical(with(x, f(z)), x$f(z)) #> [1] TRUE identical(with(x, f(z)), f(z)) #> [1] TRUE`

**Q**: Compare and contrast the effects of`env_bind()`

vs.`attach()`

for the following code.`funs <- list( mean = function(x) mean(x, na.rm = TRUE), sum = function(x) sum(x, na.rm = TRUE) ) attach(funs) #> The following objects are masked from package:base: #> #> mean, sum mean <- function(x) stop("Hi!") detach(funs) env_bind(globalenv(), !!!funs) mean <- function(x) stop("Hi!") env_unbind(globalenv(), names(funs))`

**A**:`attach()`

adds`funs`

to the search path. Therefore, the provided functions are found before their respective versions from the base package. Further, they can not get accidently overwritten by similar named functions in the global environment. One annoying downsinde of using`attach()`

is the possibility to attach the same object multiple times, making it necessary to call`detach()`

equally often.In contrast

`rlang::env_bind()`

just adds the functions in`fun`

to the global environment. No further side effects are introduced and the functions are overwritten when similarly named functions are defined.

## 10.6 Old exercises

### 10.6.1 Closures

**Q**: Why are functions created by other functions called closures?

**A**: As stated in the book:

`> because they enclose the environment of the parent function and can access all its variables.`

**Q**: What does the following statistical function do? What would be a better name for it? (The existing name is a bit of a hint.)`bc <- function(lambda) { if (lambda == 0) { function(x) log(x) } else { function(x) (x ^ lambda - 1) / lambda } }`

**A**: It is the logarithm, when lambda equals zero and`x ^ lambda - 1 / lambda`

otherwise. A better name might be`box_cox_transformation`

(one parametric), you can read about it (here)[https://en.wikipedia.org/wiki/Power_transform].**Q**: What does`approxfun()`

do? What does it return?

**A**:`approxfun`

basically takes a combination of 2-dimensional data points + some extra specifications as arguments and returns a stepwise linear or constant interpolation function (defined on the range of given x-values, by default).**Q**: What does`ecdf()`

do? What does it return?

**A**: “ecdf” means empirical density function. For a numeric vector,`ecdf()`

returns the appropriate density function (of class “ecdf”, which is inheriting from class “stepfun”). You can describe it’s behaviour in 2 steps. In the first part of it’s body, the`(x,y)`

pairs for the nodes of the density function are calculated. In the second part these pairs are given to`approxfun`

.**Q**: Create a function that creates functions that compute the ith central moment of a numeric vector. You can test it by running the following code:`m1 <- moment(1) m2 <- moment(2) x <- runif(100) stopifnot(all.equal(m1(x), 0)) stopifnot(all.equal(m2(x), var(x) * 99 / 100))`

**A**: For a discrete formulation look here`moment <- function(i){ function(x) sum((x - mean(x)) ^ i) / length(x) }`

**Q**: Create a function`pick()`

that takes an index,`i`

, as an argument and returns a function with an argument`x`

that subsets`x`

with`i`

.`lapply(mtcars, pick(5)) # should do the same as this lapply(mtcars, function(x) x[[5]])`

**A**:`pick <- function(i){ function(x) x[[i]] } stopifnot(identical(lapply(mtcars, pick(5)), lapply(mtcars, function(x) x[[5]])) )`

### 10.6.2 Case study: numerical integration

**Q**: Instead of creating individual functions (e.g.,`midpoint()`

,`trapezoid()`

,`simpson()`

, etc.), we could store them in a list. If we did that, how would that change the code? Can you create the list of functions from a list of coefficients for the Newton-Cotes formulae?

**A**:**Q**: The trade-off between integration rules is that more complex rules are slower to compute, but need fewer pieces. For`sin()`

in the range [0, \(\pi\)], determine the number of pieces needed so that each rule will be equally accurate. Illustrate your results with a graph. How do they change for different functions?`sin(1 / x^2)`

is particularly challenging.

**A**: