2 Names and values
We use the development version of lobstr to answer questions regarding the internal representation of R objects.
2.1 Binding basics
Q: Explain the relationship between
din the following code:
cpoint to the same object (with the same address in memory). This object has the value
dpoints to a different object with the same value.
Q: The following code accesses the mean function in multiple ways. Do they all point to the same underlying function object? Verify with
A: Yes, they point to the same object. We confirm this by inspecting the address of the underlying function object.
Q: By default, base R data import functions, like
read.csv(), will automatically convert non-syntactic names to syntactic names. Why might this be problematic? What option allows you to suppress this behaviour?
A: When automatic and implicit (name) conversion occurs, the prediction of a scripts output will be more difficult. For example when R is used non-interactively and some data is read, transformed and written, than the output may not contain the same names as the original data source. This behaviour may introduce problems in downstream analysis. To avoid automatic name conversion set
Q: What rules does
make.names()use to convert non-syntactic names into syntactic names?
A: A valid name starts with a letter or a dot (which must not be followed by a number). It also consists of letters, numbers, dots and underscores only (Three main mechanisms ensure syntactically valid names (see
"_"are allowed since R version 1.9.0).
The variable name will be prepended by an
Xwhen names do not start with a letter or start with a dot followed by a number
(additionally) non-valid characters are replaced by a dot
reserved R keywords (see
?reserved) are appended by a dot
Interestingly, some of these transformations are influenced by the current locale (from
The definition of a letter depends on the current locale, but only ASCII digits are considered to be digits.
Q: I slightly simplified the rules that govern syntactic names. Why is
.123e1not a syntactic name? Read
?make.namesfor the full details.
.123e1is not a syntactic name, because it starts with one dot which is followed by a number.
Q: Why is
1:10is called an object with an address in memory is created, but it is not bound to a name. Therefore the object cannot be called or manipulated from R. As no copies will be made, it is not useful to track the object for copying.
Q: Explain why
tracemem()shows two copies when you run this code. Hint: carefully look at the difference between this code and the code show earlier in the section.
A: Initially the vector
xhas integer type. The replacement call assigns a double to the third element of
x, which triggers copy-on-modify. Because of R’s coercion rules, a type conversion occurs, which affects the vector as a whole and leads to an additional copy.
By assigning an integer instead of a double one copy (the one related to coercion) may be avoided:
Q: Sketch out the relationship between the following objects:
acontains a reference to an address with the value
bcontains a list of two references to the same address as
ccontains a list of
b(containing two references to
a(containing the same reference again) and a reference pointing to a different address containing the same value (
Q: What happens when you run this code:
Draw a picture.
A: The initial reference tree of
xshows, that the name
xbinds to a list object. This object contains a reference to the integer vector
xis assigned to an element of itself copy-on-modify takes place and the list is copied to a new address in memory.
The list object previously bound to
xis now referenced in the newly created list object. It is no longer bound to a name. The integer vector is referenced twice.
2.3 Object size
Q: In the following example, why are
obj_size(y)so radically different? Consult the documentation of
object.size()doesn’t account for shared elements within lists. Therefore, the results differ by a factor of ~ 100.
Q: Take the following list. Why is its size somewhat misleading?
A: It is somewhat misleading, because all three functions are built-in to R as part of the base and stats packages and hence always loaded.
From the following calculations we can see that this applies to about 2696 objects which are usually loaded by default and take up about 50.96 MB of memory.
base_env_names <- c("package:stats", "package:graphics", "package:grDevices", "package:utils", "package:datasets", "package:methods" , "Autoloads" , "package:base") base_env_list <- sapply(base_env_names, function(x) mget(ls(x, all = TRUE), as.environment(x))) sum(lengths(base_env_list)) #>  2696 sapply(base_env_list, lobstr::obj_size) #> package:stats package:graphics package:grDevices package:utils #> 11446216 3206304 1831680 7428184 #> package:datasets package:methods Autoloads package:base #> 604208 13437416 288 15622504 round(sum(sapply(base_env_list, lobstr::obj_size)) / 1024^2, 2) #>  51.1
Q: Predict the output of the following code:
A: To predict the object size of
xlet’s first find out how much memory an empty double occupies and how the size grows with the length of the vector.
We see that R requires 48 bytes for a double of length 0. Generally each additional element in a vector requires 8 additional bytes of memory. But for some small vectors R preallocates a little more memory than needed, which improves performance. Here is Hadley´s explanation from the first edition of Advanced R:
… why does the memory size grow irregularly? To understand why, you need to know a little bit about how R requests memory from the operating system. Requesting memory (with malloc()) is a relatively expensive operation. Having to request memory every time a small vector is created would slow R down considerably. Instead, R asks for a big block of memory and then manages that block itself. This block is called the small vector pool and is used for vectors less than 128 bytes long. For efficiency and simplicity, it only allocates vectors that are 8, 16, 32, 48, 64, or 128 bytes long. … Beyond 128 bytes, it no longer makes sense for R to manage vectors. After all, allocating big chunks of memory is something that operating systems are very good at. Beyond 128 bytes, R will ask for memory in multiples of 8 bytes. This ensures good alignment.
However, to estimate
obj_size(x)we just calculate 48 B + 1000000 * 8 B = 8000048 B (about 8 MB), which proves correct:
y <- list(x, x)both list elements of
ycontain references to the same memory address, so no additional memory is required for the second list element. The list itsself requires 64 bytes, 48 byte for an empty list and 8 byte for each element (
obj_size(vector("list", 2))). This let’s us predict 8000048 B + 64 B = 8000112 B:
yalready contains references to
x, so no extra memory is needed for
xand the amount of required memory stays the same.
When we modify the first element of
y[]copy-on-modify occurs and the object will have the same size (8000040 bytes) and a new address in memory. So
y’s elements don’t share references anymore. Because of this their object sizes add up to the sum of of the two different vectors and the length-2 list: 8000048 B + 8000048 B + 64 B = 16000160 B (16 MB).
The second element of
ystill references to the same address as
xand therefore the amount of memory used for both
When we modify the second element of
y, this element will also point to a new memory address. This doesn´t affect the memory size of the list:
ydoesn’t share references with
xanymore, the memory usage of the objects now adds up:
Q: Explain why the following code doesn’t create a circular list.
A: In this situation Copy-on-modify prevents the creation of a circular list. Let’s step through the details as follows:
x <- list() # creates initial object obj_addr(x) #>  "0x9792f40" tracemem(x) #>  "<0x9792f40>" x[] <- x # Copy-on-modify triggers new copy #> tracemem[0x9792f40 -> 0x83d14f0]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> do.call eval eval eval eval eval.parent local obj_addr(x) # copied object has new memory address #>  "0xced8bb8" obj_addr(x[]) # list element contains old memory adress #>  "0x9792f40"
Q: Wrap the two methods for subtracting medians into two functions, then use the bench package to carefully compare their speeds. How does performance change as the number of columns increase?
A: First, let’s define a function to create some random data and a function to subtract the median from each column.
We can then profile the performance, by benchmarking
subtact_medians()on data frame- and list-input for a specified number of columns. The functions should both input and output a data frame, so one is going to do a bit more work.
Then bench package allows us to run our benchmark across a grid of parameters easily. We will use it to slowly increase the number of columns containing random data.
results <- bench::press( ncol = c(1, 5, 10, 50, 100, 200, 400, 600, 800, 1000, 1500), compare_speed(ncol) ) library(ggplot2) ggplot(results, aes(ncol, median, col = expression)) + geom_point(size = 2) + geom_smooth() + labs(x = "Number of Columns of Input Data", y = "Computation Time", color = "Input Data Structure", title = "Benchmark: Median Subtraction")
The execution times for median subtraction on data frames columns increase exponentially with the number of columns in the input data. This is because, the data frame will be copied more often and the copy will also be bigger. For subtraction on list elements the execution time increases only linearly.
For list input with less than ~ 800 columns, the cost of the additional data structure conversion is relatively big. For very wide data frames the overhead of the additional copies slows down the computation considerably. Apparently the choice of the faster function depends on the size of the data also.
Q: What happens if you attempt to use
tracemem()on an environment?
tracemem()cannot be used to mark and trace environments.
The error occurs because “it is not useful to trace NULL, environments, promises, weak references, or external pointer objects, as these are not duplicated” (see
?tracemem). Environments are always modified in place.