20 Evaluation

20.1 Prerequisites

20.2 Evaluation Basics

  1. Q: Carefully read the documentation for source(). What environment does it use by default? What if you supply local = TRUE? How do you provide a custom argument?

    A: By default, source() uses the global environment, but another evaluation environment may also be chosen, by passing it to the local-argument. To use a local environment set local = TRUE.

  2. Q: Predict the results of the following lines of code:

    A: You can see the output of the code above here:

    #> [1] 4
    #> [1] 4
    #> eval(quote(eval(quote(eval(quote(2 + 2))))))

    Generally, 2 + 2 is first quoted as an expression and than evaluated to 4. When we nest calls to quote() and eval() more calls will be added to the AST, but the pattern of quoting and evaluating stays the same.

    When we wrap this expression in another eval() the 4 will be evaluated once more, but the result doesn’t change. When we quote it, no evaluation takes place and we capture the expression instead.

  3. Q: Fill in the function bodies below to re-implement get() using sym() and eval(), andassign() using sym(), expr(), and eval(). Don’t worry about the multiple ways of choosing an environment that get() and assign() support; assume that the user supplies it explicitly

    A: The reimplemantion of these two function using tidy evaluation is based on building an expression, which is then evaluated.

    The implementation could be even more concise, if the user would provide an expression instead of a string:

    To build the correct expression for the value assignment, we unquote using !!. Bang bang! :)

  4. Q: Modify source2() so it returns the result of every expression, not just the last one. Can you eliminate the for loop?

    A:

    • keep in mind <- invisibly returns its value
  5. Q: The code generated by source2() lacks source references. Read the source code for sys.source() and the help for srcfilecopy(), then modify source2() to preserve source references. You can test your code by sourcing a function that contains a comment. If successful, when you look at the function, you’ll see the comment and not just the source code.

    A:

    • the comment is still missing
  6. Q: We can make base::local() slightly easier to understand by spreading it out over multiple lines:

Explain how local() works in words. (Hint: you might want to print(call) to help understand what substitute() is doing, and read the documentation to remind yourself what environment new.env() will inherit from.)

A:

```r
local3 <- function(expr, envir = new.env()) {
  call <- substitute(eval(quote(expr), envir))
  print(call)
  eval(call, envir = parent.frame())
}

foo <- local3({
  x <- 10
  x * 2
})
#> eval(quote({
#>     x <- 10
#>     x * 2
#> }), new.env())

foo
#> [1] 20
```

substitute only replaces the expr with the input and the environment (for the call to eval) by the relate

```r
# how does substitute work?
```
  • substitute opperates in function execution environment, it replaces the variables bound in this environments by their expression (expr becomes the input, and envir, becomes the environment passed to local3 (new.env() by default))

```r
# this is, what is happening afterwards
eval(eval(quote({x <- 10; x * 2}), new.env()), parent.frame())
#> [1] 20

eval(quote({x <- 10; x * 2}), new.env())  # this is evaluated locally
#> [1] 20

eval(20, parent.frame())  # makes it available in the caller environment
#> [1] 20
```

20.3 Quosures

  1. Q: Predict what evaluating each of the following quosures will return.

    A: Each quosure will be evaluated in it’s own environment. This leads us to:

  2. Q: Write an enenv() function that captures the environment associated with an argument. (Hint: this should only require two function calls.)

    A: A quosure captures both the expression and the environment. From a captured quosure, we can access the environment with the help of get_env().

20.4 Data Masks

  1. Q: Why did I use a for loop in transform2() instead of map()? Consider transform2(df, x = x * 2, x = x * 2).

    A: Within the for-loop the evaluation of previous steps has already been assigned to .data, which makes the specification of chained transformations possible.

    We see, that the computation (x * 2) * 2 has been taking place - the output of the first transformation has been used as input for the second transformation. This feature is not available, with map(), becaus it evaluates each element of the input seperately. This implies, that the individual transformations are independent of another and that intermediate computations are not available for the subsequent transformations.

    The repeated retransformation of columns is very useful in interactive data analysis and can also be used within dplyr::mutate(). Be a little careful with reusing the same name too often though, because this spreads out related transformations across multiple statements.

  2. Q: Here’s an alternative implementation of subset2():

    Compare and contrast subset3() to subset2(). What are its advantages and disadvantages?

    A: Let’s take a closer look at subset2() first:

    We see, that there is an additional logical check, which is missing from subset3(). The here the logical condition rows is evaluated in the context of data, which results in a logical vector used for subsetting. Afterwards only [ needs to be used to return the subset.

    With subset3() both of these steps occur in a single line. This means, that the subsetting is also evaluated in the context of the data mask.

    This is shorter, but also less readable, because the evaluation and the subsetting take place in the same expression. It may also introduce unwanted errors, if the data mask should contain an element named data, because the object from the data mask takes precedence over argument of the function.

  1. Q: The following function implements the basics of dplyr::arrange(). Annotate each line with a comment explaining what it does. Can you explain why !!.na.last is strictly correct, but omitting the !! is unlikely to cause problems?

    A: This function builds an expression, which contains the specified order()-call. The !!!-operator is used, which allows multiple arguments to be included (to break ties). Once the correct roworder is determined, numeric subsetting is used to return the rearranged data frame.

    By using !!.na.last the .na.last-argument is unquoted, when the order()-call is built. That way, the na.last-argeument is already correctly specified (typically TRUE, FALSE or NA).

    Without the unquoting, the expression would read na.last = .na.last. The value for .na.last would still have to be looked up and found. Because these computations take place inside of the functions execution environment (which contains .na.last), this is unlikely to cause problems.

    PS: Putting breakpoints (browser()) inside these functions was really helpful to figure out, what’s going on inside of them.

20.5 Using tidy evaluation

  1. Q: I’ve included an alternative implementation of threshold_var() below. What makes it different to the approach I used above? What makes it harder?

    A: Lets compare this approach to the original implementation:

    We can see, that the symbol in no longer coerced to a string in threshold_var2(). Therefore $ instead of [[ is used for subsetting. Initially we suspected partial matching to work with $, but this seems to avoided, when the expression is tidily evaluated.

    The prefix call to $() is less common than infix-subsetting using [[, but ultimately both functions seem to behave the same.

20.6 Base evaluation

  1. Q: Why does this function fail?

    A: In this function, lm_call is evaluated in the caller environment, which happens to be the global environment. In this environment, the name data is bound to utils::data. To fix the error, we can either set the evaluation environment to the functions execution environment or unquote the data argument when building the call to lm().

    When we want to unquote an argument within a function, we first need to capture the user-input (by enenxpr()).

  2. Q: When model building, typically the response and data are relatively constant while you rapidly experiment with different predictors. Write a small wrapper that allows you to reduce duplication in the code below.

    A: In the wrapping function below, the response and the data were defined as default argument to the function. It would also be acceptable to “hardcode” them into the lm()-expression instead, but this way provides a little more flexibility.

    In practice, small wrappers like this can help keeping your scripts well organized and make it easy to see, what is being changed.

  3. Q: Another way to way to write resample_lm() would be to include the resample expression (data[sample(nrow(data), replace = TRUE), , drop = FALSE]) in the data argument. Implement that approach. What are the advantages? What are the disadvantages?

    A: We can take advantage of the lazy evaluation of function arguments, by moving the resampling step into the argument definition. The uses passes the data to the function, but only a permutation of this data (rsampled_data) will be used.

    With this approach the evaluation needs to take place within the functions environments, because the resampled dataset (defined as a default argument) will only be available in the function environment.

    Overall, putting an essential part of the preprocessing outside of the functions body is not common practice in R. Compared to the unquoting-implementation (resample_lm1()), this approach captures the model-call in a more meaningful way.

20.7 Old exercises

  1. Q: Run this code in your head and predict what it will print. Confirm or refute your prediction by running the code in R.

  2. Q: What happens if you use expr() instead of enexpr() inside of subset2()?

  3. Q: Improve subset2() to make it more like real base::subset():

    • Drop rows where subset evaluates to NA
    • Give a clear error message if subset doesn’t yield a logical vector
    • What happens if subset yields a vector that’s not the same as the number of rows in data? What do you think should happen?
  4. Q: The third argument in base::subset() allows you to select variables. It treats variable names as if they were positions. This allows you to do things like subset(mtcars, , -cyl) to drop the cylinder variable, or subset(mtcars, , disp:drat) to select all the variables between disp and drat. How does this work? I’ve made this easier to understand by extracting it out into its own function that uses tidy evaluation.

  5. Q: Here’s an alternative implementation of arrange():

    Describe the primary difference in approach compared to the function defined in the text.

    One advantage of this approach is that you could check each element of ... to make sure that input is correct. What property should each element of ords have?

  6. Q: Here’s an alternative implementation of subset2():

    Use intermediate variables to make the function easier to understand, then explain how this approach differs to the approach in the text.

  7. Q: Implement a form of arrange() where you can request a variable to sorted in descending order using named arguments:

    (Hint: The descreasing argument to order() will not help you. Instead, look at the definition of dplyr::desc(), and read the help for xtfrm().)

  8. Q: Why do you not need to worry about ambiguous argument names with ... in arrange()? Why is it a good idea to use the . prefix anyway?

  9. Q: What does transform() do? Read the documentation. How does it work? Read the source code for transform.data.frame(). What does substitute(list(...)) do?

  10. Q: Use tidy evaluation to implement your own version of transform(). Extend it so that a calculation can refer to variables created by transform, i.e. make this work:

  11. Q: What does with() do? How does it work? Read the source code for with.default(). What does within() do? How does it work? Read the source code for within.data.frame(). Why is the code so much more complex than with()?

  12. Q: Implement a version of within.data.frame() that uses tidy evaluation. Read the documentation and make sure that you understand what within() does, then read the source code.

  1. Q: When model building, typically the response and data are relatively constant while you rapidly experiment with different predictors. Write a small wrapper that allows you to reduce duplication in this situation.

  2. Q: Another way to way to write boot_lm() would be to include the boostrapping expression (data[sample(nrow(data), replace = TRUE), , drop = FALSE]) in the data argument. Implement that approach. What are the advantages? What are the disadvantages?

  3. Q: To make these functions somewhat more robust, instead of always using the caller_env() we could capture a quosure, and then use its environment. However, if there are multiple arguments, they might be associated with different environments. Write a function that takes a list of quosures, and returns the common environment, if they have one, or otherwise throws an error.

  4. Q: Write a function that takes a data frame and a list of formulas, fitting a linear model with each formula, generating a useful model call.

  5. Q: Create a formula generation function that allows you to optionally supply a transformation function (e.g. log()) to the response or the predictors.