1 Names and values

Prerequisites

We use the development version of lobstr to answer questions regarding the internal representation of R objects.

1.1 Binding basics

  1. Q: Explain the relationship between a, b, c and d in the following code:

    A: a, b, c point to the same object (with the same address in memory). This object has the value 1:10. d points to a different object with the same value.

  2. Q: The following code accesses the mean function in multiple ways. Do they all point to the same underlying function object? Verify with lobstr::obj_addr().

    A: Yes, they point to the same object. We confirm this by inspecting the address of the underlying function object.

  3. Q: By default, base R data import functions, like read.csv(), will automatically convert non-syntactic names to syntactic names. Why might this be problematic? What option allows you to suppress this behaviour?

    A: When automatic and implicit (name) conversion occurs, the prediction of a scripts output will be more difficult. For example when R is used non-interactively and some data is read, transformed and written, than the output may not contain the same names as the original data source. This behaviour may introduce problems in downstream analysis. To avoid automatic name conversion set check.names=FALSE.

  4. Q: What rules does make.names() use to convert non-syntactic names into syntactic names?

    A: A valid name starts with a letter or a dot (which must not be followed by a number). It also consists of letters, numbers, dots and underscores only ("_" are allowed since R version 1.9.0).

    Three main mechanisms ensure syntactically valid names (see ?make.names):
    • The variable name will be prepended by an X when names do not start with a letter or start with a dot followed by a number
    • (additionally) non-valid characters are replaced by a dot
    • reserved R keywords (see ?reserved) are appended by a dot

    Interestingly, some of these transformations are influenced by the current locale (from ?make.names):

    The definition of a letter depends on the current locale, but only ASCII digits are considered to be digits.

  5. Q: I slightly simplified the rules that govern syntactic names. Why is .123e1 not a syntactic name? Read ?make.names for the full details.

    A: .123e1 is not a syntactic name, because it starts with one dot which is followed by a number.

1.2 Copy-on-modify

  1. Q: Why is tracemem(1:10) not useful?

    A: When 1:10 is called an object with an address in memory is created, but it is not bound to a name. Therefore the object cannot be called or manipulated from R. As no copies will be made, it is not useful to track the object for copying.

  2. Q: Explain why tracemem() shows two copies when you run this code. Hint: carefully look at the difference between this code and the code show earlier in the section.

    A: Initially the vector x has integer type. The replacement call assigns a double to the third element of x, which triggers copy-on-modify. Because of R’s coercion rules, a type conversion occurs, which affects the vector as a whole and leads to an additional copy.

    By assigning an integer instead of a double one copy (the one related to coercion) may be avoided:

  3. Q: Sketch out the relationship between the following objects:

    A: a contains a reference to an address with the value 1:10. b contains a list of two references to the same address as a. c contains a list of b (containing two references to a), a (containing the same reference again) and a reference pointing to a different address containing the same value (1:10).

  4. Q: What happens when you run this code:

    Draw a picture.

    A: The initial reference tree of x shows, that the name x binds to a list object. This object contains a reference to the integer vector 1:10.

    When x is assigned to an element of itself copy-on-modify takes place and the list is copied to a new address in memory.

    The list object previously bound to x is now referenced in the newly created list object. It is no longer bound to a name. The integer vector is referenced twice.

1.3 Object size

  1. Q: In the following example, why are object.size(y) and obj_size(y) so radically different? Consult the documentation of object.size().

    A: object.size() doesn’t account for shared elements within lists. Therefore, the results differ by a factor of ~ 100.

  2. Q: Take the following list. Why is its size somewhat misleading?

    A: It is somewhat misleading, because all three functions are built-in to R as part of the base and stats packages and hence always loaded.

    From the following calculations we can see that this applies to about 2696 objects which are usually loaded by default and take up about 50.93 MB of memory.

  3. Q: Predict the output of the following code:

    A: TODO: lobstr and pryr return very different results (600 bytes vs 4MB in the first example). So, before we rewrite this answer it needs to be clarified, why these differences occur and how to handle them best. See also the related issue: https://github.com/hadley/adv-r/issues/1324.

    Since lobstr::obj_size() currently returns very different values, we will use unclass(pryr::obj_size()) for now.

    To predict the size of x, we first find out via obj_size(integer(0)) that an integer takes 48 B. For every element of the integer vector additionally 4 B are needed and R allocates memory in chunks of 2, so 8 B at a time. This can be verified for example via sapply(1:100, function(x) obj_size(integer(x))). Overall our prediction will result in 40 B + 1000000 * 4 B = 4000040 B:

    To predict the size of y <- list(x, x) consider that both list elements point to the same memory address. They share the same reference, which means that no additional memory is needed. A list takes 40 B in memory and 8 B for each element. Overall our prediction will result in x (4000040 B) + list of length 2 (40 B + 16 B):

    Since x and y are names with bindings to objects that point to the same reference, no additional memory is needed and our prediction is the maximum memory of both objects (y; 4000040 B):

    The next one gets a bit more tricky. Since the first element of y becomes different to x, a completely new object is created in memory. Hence 10 is of type double (which triggers a silent coercion), the new object will take more memory. A double needs 40 B + length * 8 B (overall 8000040 B). So we get: first element of y (8000040 B) + second element of y (x; 4000040 B) + list of length 2 (40 B + 16 B) = 12000136 B as our prediction:

    Again all elements of x are shared within y (x is the second element of y). So the overall memory usage corresponds to y’s:

    In the next example also the second element of y gets the same value as the first one. However, R does not now, that it is the same as the first element, so a new object is created taking the same amount of memory:

    Now x and y don’t share any values anymore (from R’s perspective) and their memory adds up:

1.4 Modify-in-place

  1. Q: Wrap the two methods for subtracting medians into two functions, then use the bench package to carefully compare their speeds. How does performance change as the number of columns increase?

    A: First, let’s define a function to create some random data and a function to subtract the median from each column.

    We can then profile the performance, by benchmarking subtact_medians() on data frame- and list-input for a specified number of columns. The functions should both input and output a data frame, so one is going to do a bit more work.

    Then bench package allows us to run our benchmark across a grid of parameters easily. We will use it to slowly increase the number of columns containing random data.

    The execution times for median subtraction on data frames columns increase exponentially with the number of columns in the input data. This is because, the data frame will be copied more often and the copy will also be bigger. For subtraction on list elements the execution time increases only linearly.

    For list input with less than ~ 800 columns, the cost of the additional data structure conversion is relatively big. For very wide data frames the overhead of the additional copies slows down the computation considerably. Apparently the choice of the faster function depends on the size of the data also.

  2. Q: What happens if you attempt to use tracemem() on an environment?

    A: tracemem() cannot be used to mark and trace environments.

    The error occurs because “it is not useful to trace NULL, environments, promises, weak references, or external pointer objects, as these are not duplicated” (see ?tracemem). Environments are always modified in place.