31 Non standard evaluation

31.1 Capturing expressions

  1. Q: One important feature of deparse() to be aware of when programming is that it can return multiple strings if the input is too long. For example, the following call produces a vector of length two:

    Why does this happen? Carefully read the documentation for ?deparse. Can you write a wrapper around deparse() so that it always returns a single string?

    A: deparse() has a width.cutoff argument (default 60 byte), which is according to ?deparse an:

    integer in [20, 500] determining the cutoff (in bytes) at which line-breaking is tried.

    Further:

    width.cutoff is a lower bound for the line lengths: deparsing a line proceeds until at least width.cutoff bytes have been output and e.g. arg = value expressions will not be split across lines.

    You can wrap it with for example with paste0():

    It can be a little bit enhanced with a gsub():

    This formats at least the spaces to a unified single space. However note that it is not possible to capture the exact input in every case:

  2. Q: Why does as.Date.default() use substitute() and deparse()? Why does pairwise.t.test() use them? Read the source code.

    A: as.Date.default() uses them to convert unexpected input expressions (neither dates, nor NAs) into a character string and return it within an error message.

    pairwise.t.test() uses them to convert the names of its datainputs (response vector x and grouping factor g) into character strings to format these further into a part of the desired output.

  3. Q: pairwise.t.test() assumes that deparse() always returns a length one character vector. Can you construct an input that violates this expectation? What happens?

    A: We can pass an expression to one of pairwise.t.test()’s data input arguments, which exceeds the default cutoff width in deparse(). The expression will be split into a character vector of length greater 1. The deparsed data inputs are directly pasted (read the source code!) with “and” as separator and the result is just used to be displayed in the output. Just the data.name output will change (it will include more than one “and”).

  4. Q: f(), defined above, just calls substitute(). Why can’t we use it to define g()? In other words, what will the following code return? First make a prediction. Then run the code and think about the results.

    A: All return x, because substitute()’s second argument env is the current evaluation environment environment(). If you call substitute from another function, you may want to set the env argument to parent.frame(), which refers to the calling environment:

31.2 Non standard evaluation in subset

  1. Q: Predict the results of the following lines of code:

    A: An outside quote() always wins…

  2. Q: subset2() has a bug if you use it with a single column data frame. What should the following code return? How can you modify subset2() so it returns the correct type of object?

    A: Well what does base::subset return?

    So we want that the output is always a data frame and not an atomic vector like above. To return always a data frame change the last row in subset2() to x[r, , drop = FALSE].

  3. Q: The real subset function (subset.data.frame()) removes missing values in the condition. Modify subset2() to do the same: drop the offending rows.

    A: This time change the last row to x[!is.na(r) & r, , drop = FALSE]. Alternatively you can also exclude NAs from the subset via setting them to FALSE with r[is.na(r)] <- FALSE.

  4. Q: What happens if you use quote() instead of substitute() inside of subset2()?

    A: R looks for condition within sample_df but can’t find it, so it is looking in the execution environment for condition and evaluates it to a >= 4 (as supplied in the input). In the actual environment and the remaining environments (the global environment and the search path) a can’t be found and we get the error “Error in eval(expr, envir, enclos) : object ‘a’ not found”. To understand this in detail, it is very important to forget about substitute() for a moment and just explore where eval() evaluates its supplied expressions for all kind of supplied envir and enclos arguments. Before you get crazy (since a lot of stuff is coming togetehr here), look also here and here.

    The above is opposed to substitute(), which isn’t only capturing the symbol condition, but the expression slot of the condition promise object, which means, that substitute() notices, when a promise is assigned as it’s first argument and also stores this information. To be more precise, we quote from R Language Definition

    A formal argument is really a promise, an object with three slots, one for the expression that defines it, one for the environment in which to evaluate that expression, and one for the value of that expression once evaluated. substitute will recognize a promise variable and substitute the value of its expression slot.

  5. Q: The second argument in subset() allows you to select variables. It treats variable names as if they were positions. This allows you to do things like subset(mtcars, , -cyl) to drop the cylinder variable, or subset(mtcars, , disp:drat) to select all the variables between disp and drat. How does this work? I’ve made this easier to understand by extracting it out into its own function.

    A: We can comment what happens

    This works also for ranges, i.e.,

    because of the usual precedences cyl:drat becomes 2:5.

  6. Q: What does evalq() do? Use it to reduce the amount of typing for the examples above that use both eval() and quote().

    A: From the help of eval():

    The evalq form is equivalent to eval(quote(expr), …). eval evaluates its first argument in the current scope before passing it to the evaluator: evalq avoids this.

    In other “words”:

    The examples above can be written as:

31.3 Scoping issues

  1. Q: plyr::arrange() works similarly to subset(), but instead of selecting rows, it reorders them. How does it work? What does substitute(order(...)) do? Create a function that does only that and experiment with it.

    A: substitute(order(...)) orders the indices of the supplied columns in ... in the context of the submitted data.frame argument, beginning with the first submitted column.

    We can just copy the part of the source code from plyr::arrange() and see if it does what we expect:

  2. Q: What does transform() do? Read the documentation. How does it work? Read the source code for transform.data.frame(). What does substitute(list(...)) do?

    A: As stated in the next question transform() is similar to plyr::mutate() but plyr::mutate() applies the transformations sequentially so that transformation can refer to columns that were just created. The rest of the question can be answered, by just commenting the source code:

    # Setting "..." as function argument allows the user to specify any kind of extra 
    # argument to the function. In this case we can expect arguments of the form 
    # new_col1 = foo(col_in_data_argument), new_col2 = foo(col_in_data_argument),... 
    > transform.data.frame
    function (`_data`, ...) 
    {
      # subsitute(list(...)) takes the dots into a list and just returns the expression
      # `list(...)`. Nothing is evaluated until now (which is important). 
      # Evaluation of the expression happens with the `eval()` function.
      # This means: all the names of the arguments in `...` like new_col1, new_col2,...
      # become names of the list `e`.
      # All functions/variables like foo(column_in_data_argument) are evaluated within
      # the context (environment) of the `_data` argument supplied to the `transform()` 
      # function (this is specified by the second argument of the eval() function).
      e <- eval(substitute(list(...)), `_data`, parent.frame())
    
      # Everything that happens from now on is just about formatting and
      # returning the correct columns:
      # We save the names of the list (the names of the added columns)
      tags <- names(e)
      # We create a numeric vector and check if the tags (names of the added columns) 
      # appear in the names of the supplied `_data` argument. If yes, we save the 
      # column number, if not we save an NA.
      inx <- match(tags, names(`_data`))
      # We create a logical vector, which is telling us if a column_name is already in the
      # data.frame (TRUE) or really new (FALSE)
      matched <- !is.na(inx)
      # If any new column is corresponding to an old column name,
      # the correspong old columns will be overwritten
      if (any(matched)) {
        `_data`[inx[matched]] <- e[matched]
        `_data` <- data.frame(`_data`)
      }
      # If there is at least one new column name, all of these new columns will be bound
      # on the old data.frame (which might have changed a bit during the first if). Then the
      # transformed `data_` is returned
      if (!all(matched)) 
        do.call("data.frame", c(list(`_data`), e[!matched]))
      # Also in case of no new column names the transformed `data_` is returned
      else `_data`
    }
  3. Q: plyr::mutate() is similar to transform() but it applies the transformations sequentially so that transformation can refer to columns that were just created:

    How does mutate work? What’s the key difference between mutate() and transform()?

    A: The main difference is the possibility of sequential transformations. Another difference is that unnamed added columns will be thrown away. For the implementation many ideas are are the same. However the key difference is that for the sequential transformations, a for loop is created which iterates over a list of expressions and simultaneously changes the environment for the evaluation of the next expression (which is the supplied data). This should become clear with some comments on the code:

  4. Q: What does with() do? How does it work? Read the source code for with.default(). What does within() do? How does it work? Read the source code for within.data.frame(). Why is the code so much more complex than with()?

    A: with() is a generic function that allows writing an expression (second argument) that refers to variablenames of data (first argument) as if the corresponding variables were objects themselves.

    with() evaluates the expression via an

    construct in a temporary environment, which has the calling frame as a parent. This also means that variables that aren’t found in data, will be looked up in with()’s calling environment. As stated in ?with, this is useful for modelling functions.

    In contrast to with(), which returns the value of the evaluated expression, within() returns the modified object. So within() can be used as an alternative to base::transform(). within() first creates an environment with data as parent and within()’s calling environment as grandparent. This environment becomes changed, since afterwards the expression is evaluated inside of it. The rest of the code converts this environment into a list and ensures that new variables are not overriden by the former ones.

31.4 Calling from another function

  1. Q: The following R functions all use NSE. For each, describe how it uses NSE, and read the documentation to determine its escape hatch.
    • rm()
    • library() and require()
    • substitute()
    • data()
    • data.frame()

    A: For NSE in rm(), we just look at its first two arguments: ... and list = character(). If we supply expressions to ... (which can also be character vectors) , these will be caught by match.call() and become an unevaluated call (in this case a pairlist). However, rm() copies and converts the expressions into a character representation and concatenates these with the character vector supplied to the list argument. Then the removing starts… The escape hatch is to supply the objects to be removed as a character vector to rm()’s list argument.

    You can supply the input to library()’s and require()’s first argument (package) with or without quotes. In the default case (character.only = FALSE) the input to package will be converted via as.character(substitute(package)). To ommit this, just supply a character vector and set character.only = TRUE.

    substitute() and eval()/quote are the basic functions for NSE. To see how it’s done one has to understand parse trees and/or look into the underlying C code. The problematic behaviour of substitute() is pretty obvious. There might be some insights that make it predictable, but since substitute() is written for NSE and only contains the arguments expr and env, it seems that no escape hatch exists.

    Like rm() data() has the first arguments ... and list = character(). Again you can supply unquoted or quoted names to .... These will be caught, converted to character via as.character(substitute(list(...))[-1L]) and concatenated with the character input of the list argument. The escape hatch is similar to rm(): use explicitly the list argument.

    data.frame()’s first argument, ..., gets caught once via object <- as.list(substitute(list(...)))[-1L] and once x <- list(...). First one is used among others to create rownames. This can be suppressed via the setting of the argument row.names, which lets you supply a vector or specifing a column of the data.frame for the explicit naming of rows. x will be deparsed later and is then used to create the columnnames. Since this process underlies several complex rules in cases of “special namings”, data.frame() provides the check.names argument. One can set check.names = FALSE, to ensure that columns will be named however they are supplied to data.frame().

  2. Q: Base functions match.fun(), page(), and ls() all try to automatically determine whether you want standard or non-standard evaluation. Each uses a different approach. Figure out the essence of each approach then compare and contrast.

    A:

    • match.fun uses NSE if you pass something other than a length-one character or symbol, and does not use NSE otherwise.
    • page uses NSE if you pass something other than a length-one character. Symbols would still trigger NSE.
    • ls triggers NSE substitute if it cannot evaluate the directory passed as a variable, and triggers NSE deparse if the result is not a character.

    The ls method seems safest of the three approaches, but is also the least performant.

  3. Q: Add an escape hatch to plyr::mutate() by splitting it into two functions. One function should capture the unevaluated inputs. The other should take a data frame and list of expressions and perform the computation.

    A: We look again at the source code of plyr::mutate():

    What we want is to have the local variable “cols” as an argument of our new (wrapped) escape hatch function (analogously as shown with subset2_q() in the textbook).

    Therefore we create:

    We also want a function, that works with “cols” and performs the computation (the for loop in the original plyr::mutate()):

    Now we can wrap these with our new mutate function and have a nice interface:

  4. Q: What’s the escape hatch for ggplot2::aes()? What about plyr::.()? What do they have in common? What are the advantages and disadvantages of their differences?

    • One can call rename_aes directly.
    • plyr::. lets you specify an env in which to evaluate ....

    Both evaluate ... using match.call() and create a structure out of them.

    plyr::. probably requires less knowledge about internals, but is also less customizable.

  5. Q: The version of subset2_q() I presented is a simplification of real code. Why is the following version better?

    Rewrite subset2() and subscramble() to use this improved version.

    A:

    The modified version of subset2_q allows you to specify an environment in which to evaluate the condition, which allows you to run subset2_q() in more situations (such as within a dataframe).

31.5 Substitute

  1. Q: Use pryr::subs() to convert the LHS to the RHS for each of the following pairs:

    • a + b + c -> a * b * c
    • f(g(a, b), c) -> (a + b) * c
    • f(a < b, c, d) -> if (a < b) c else d

    A:

  2. Q: For each of the following pairs of expressions, describe why you can’t use subs() to convert one to the other.
    • a + b + c -> a + b * c
    • f(a, b) -> f(a, b, c)
    • f(a, b, c) -> f(a, b)
    A:
    • a + b + c -> a + b * c You can’t convert one “+” to “+” and the other to “*“, because subs() converts either all instances of the”+" or no instances of the “+”.
    • f(a, b) -> f(a, b, c) subs() cannot be used to add new arguments, only convert.
    • f(a, b, c) -> f(a, b) subs() cannot be used to subtract new arguments, only convert.
  3. Q: How does pryr::named_dots() work? Read the source.

    A: It captures the dot arguments using pryr::dots (which is just eval(substitute(alist(...)))), and then gets the names of the arguments, using “” for the arguments without names.

    If all the args are “”, it simply returns the args. Otherwise, it names the args with their values, and returns the renamed list of args.

31.6 The downsides of non-standard evaluation

  1. Q: What does the following function do? What’s the escape hatch? Do you think that this is an appropriate use of NSE?

    A: nl() extracts the dots, names them, and then evaluates them in the global namespace. This returns a list of arguments that are named by what is literally in the dots, with the values of what the dots evaluate to.

    For example:

    You can always call the underlying lapply directly as an escape hatch.

    However, it is a toy example and we are not really sure what you would gain from actually using this.

  2. Q: Instead of relying on promises, you can use formulas created with ~ to explicitly capture an expression and its environment. What are the advantages and disadvantages of making quoting explicit? How does it impact referential transparency?

    A: Using formulas in this manner would allow for referential transparency, but it would make working with NSE much more verbose. In any situation in which it is worth using NSE, it would also be worth not using formulas like this.

  3. Q: Read the standard non-standard evaluation rules found at http://developer.r-project.org/nonstandard-eval.pdf.