25 Rewriting R code in C++

25.1 Getting started with C++

  1. Q: With the basics of C++ in hand, it’s now a great time to practice by reading and writing some simple C++ functions. For each of the following functions, read the code and figure out what the corresponding base R function is. You might not understand every part of the code yet, but you should be able to figure out the basics of what the function does.

    A: The code above corresponds to the following R functions:

    • f1: mean()
    • f2: cumsum()
    • f3: any()
    • f4: Position()
    • f5: pmin()
  2. Q: To practice your function writing skills, convert the following functions into C++. For now, assume the inputs have no missing values.

    1. all().

    2. cumprod(), cummin(), cummax().

    3. diff(). Start by assuming lag 1, and then generalise for lag n.

    4. range().

    5. var(). Read about the approaches you can take on Wikipedia. Whenever implementing a numerical algorithm, it’s always good to check what is already known about the problem.

    A: We take this challenge, which leads us to:

    1. all()
    1. cumprod(), cummin(), cummax().
    1. diff() (Start by assuming lag 1, and then generalise for lag n.)
    1. range()
    1. var()

25.2 Missing values

  1. Q: Rewrite any of the functions from the first exercise to deal with missing values. If na.rm is true, ignore the missing values. If na.rm is false, return a missing value if the input contains any missing values. Some good functions to practice with are min(), max(), range(), mean(), and var().

    A: We will refactor cumsum(), any(), Position() and pmin() so they can deal with missing values, bur we practice and rewrite min(), max(), range(), mean() and var() first. We try to keep the overall function behaviour close to the original function, whenever na_rm = false. We mostly stick with vector data types as return values, to avoid irregular type conversions.

    We introduce an na_rm argument to make minC() aware of NAs. In case x contains exclusively NA values minC() should return Inf for na_rm == TRUE.

    To implement maxC() we reuse minC() and take advantage of a connection between the minimum and the maximum: \(\max(x) = -\min(-x)\).

    minC() and maxC() enable us to write a compact and NA-aware rangeC() function.

    Our NA-aware meanC() function should return NaN, if na_rm = TRUE and all(is.na(x)).

    For varC(), we handle both cases of na_rm inside the first for loop, as this reduces code duplication.

    Now, let’s extend the functions cumsum(), any(), Position() and pmin() from the (first exercise).

    For na_rm = true, we keep the NA’s in the output but ignore them in the cumulative sums.

    In our new implementation of anyC() we use LogicalVetor as return type. If we would use bool instead, the C++ NA_LOGICAL would be converted into R’s logical TRUE.

    For PositionC() we check the results of the predicate function for NAs. In some cases it may also make sense check the elements of the list input for NAs, and provide an NA-handling for the predicate function.

    When we set na_rm = TRUE in our pminC() function it only returns NAs at indices where both, x and y, contain an NA.

  2. Q: Rewrite cumsum() and diff() so they can handle missing values. Note that these functions have slightly more complicated behaviour.

    A: As we already wrote an NA-aware cumsumC() function under the assumption to return a single NA in case of na_rm = false, we modify it here slightly to always return a vector with the same length as it’s x argument. Our new cumsumC2() function treats NA values always like zeros. In case of na_rm = false the NA values are kept in the output and in case of na_rm = true the NA values are replaced by the last occurring non-NA-value or zero (if it’s entirely NA values).

    For diffC()’s implementation, we again return just an NA whenever an NA value occurs. In case of na_rm = true, we ensure that calculations which are affected by NA’s will also return an NA. (We could have also chosen to exclude NAs completely from the input or - equivalently - the output).

25.3 Standard Template Library

To practice using the STL algorithms and data structures, implement the following using R functions in C++, using the hints provided:

  1. Q: median.default() using partial_sort.

    A:

  2. Q: %in% using unordered_set and the find() or count() methods.

    A:

  3. Q: unique() using an unordered_set (challenge: do it in one line!).

    A: (TODO: address the challenge.)

  4. Q: min() using std::min(), or max() using std::max().

    A:

  5. Q: which.min() using min_element, or which.max() using max_element.

    A:

  6. Q: setdiff(), union(), and intersect() for integers using sorted ranges and set_union, set_intersection and set_difference.

    A: