2 Vectors

2.1 Atomic vectors

Q1: How do you create raw and complex scalars? (See ?raw and ?complex.)

A: In R, scalars are represented as vectors of length one. However, there’s no built-in syntax like there is for logicals, integers, doubles, and character vectors to create individual raw and complex values. Instead, you have to create them by calling a function.

For raw vectors you can use either as.raw() or charToRaw() to create them from numeric or character values.

as.raw(42)
#> [1] 2a
charToRaw("A")
#> [1] 41

In the case of complex numbers, real and imaginary parts may be provided directly to the complex() constructor.

complex(length.out = 1, real = 1, imaginary = 1)
#> [1] 1+1i

You can create purely imaginary numbers (e.g.) 1i, but there is no way to create complex numbers without + (e.g. 1i + 1).

Q2: Test your knowledge of vector coercion rules by predicting the output of the following uses of c():

c(1, FALSE)      # will be coerced to double    -> 1 0
c("a", 1)        # will be coerced to character -> "a" "1"
c(TRUE, 1L)      # will be coerced to integer   -> 1 1

Q3: Why is 1 == "1" true? Why is -1 < FALSE true? Why is "one" < 2 false?

A: These comparisons are carried out by operator-functions (==, <), which coerce their arguments to a common type. In the examples above, these types will be character, double and character: 1 will be coerced to "1", FALSE is represented as 0 and 2 turns into "2" (and numbers precede letters in lexicographic order (may depend on locale)).

Q4: Why is the default missing value, NA, a logical vector? What’s special about logical vectors? (Hint: think about c(FALSE, NA_character_).)

A: The presence of missing values shouldn’t affect the type of an object. Recall that there is a type-hierarchy for coercion from character → double → integer → logical. When combining NAs with other atomic types, the NAs will be coerced to integer (NA_integer_), double (NA_real_) or character (NA_character_) and not the other way round. If NA were a character and added to a set of other values all of these would be coerced to character as well.

Q5: Precisely what do is.atomic(), is.numeric(), and is.vector() test for?

A: The documentation states that:

  • is.atomic() tests if an object is an atomic vector (as defined in Advanced R) or is NULL (!).
  • is.numeric() tests if an object has type integer or double and is not of class factor, Date, POSIXt or difftime.
  • is.vector() tests if an object is a vector (as defined in Advanced R) or an expression and has no attributes, apart from names.

Atomic vectors are defined in Advanced R as objects of type logical, integer, double, complex, character or raw. Vectors are defined as atomic vectors or lists.

2.2 Attributes

Q1: How is setNames() implemented? How is unname() implemented? Read the source code.

A: setNames() is implemented as:

setNames <- function(object = nm, nm) {
  names(object) <- nm
  object
}

Because the data argument comes first, setNames() also works well with the magrittr-pipe operator. When no first argument is given, the result is a named vector (this is rather untypical as required arguments usually come first):

setNames( , c("a", "b", "c"))
#>   a   b   c 
#> "a" "b" "c"

unname() is implemented in the following way:

unname <- function(obj, force = FALSE) {
  if (!is.null(names(obj))) 
    names(obj) <- NULL
  if (!is.null(dimnames(obj)) && (force || !is.data.frame(obj))) 
    dimnames(obj) <- NULL
  obj
}

unname() removes existing names (or dimnames) by setting them to NULL.

Q2: What does dim() return when applied to a 1-dimensional vector? When might you use NROW() or NCOL()?

A: From ?nrow:

dim() will return NULL when applied to a 1d vector.

One may want to use NROW() or NCOL() to handle atomic vectors, lists and NULL values in the same way as one column matrices or data frames. For these objects nrow() and ncol() return NULL:

x <- 1:10

# Return NULL
nrow(x)
#> NULL
ncol(x)
#> NULL

# Pretend it's a column vector
NROW(x)
#> [1] 10
NCOL(x)
#> [1] 1

Q3: How would you describe the following three objects? What makes them different to 1:5?

x1 <- array(1:5, c(1, 1, 5))  # 1 row,  1 column,  5 in third dim.
x2 <- array(1:5, c(1, 5, 1))  # 1 row,  5 columns, 1 in third dim.
x3 <- array(1:5, c(5, 1, 1))  # 5 rows, 1 column,  1 in third dim.

A: These are all “one dimensional.” If you imagine a 3d cube, x1 is in the x-dimension, x2 is in the y-dimension, and x3 is in the z-dimension. In contrast to 1:5, x1, x2 and x3 have a dim attribute.

Q4: An early draft used this code to illustrate structure():

structure(1:5, comment = "my attribute")
#> [1] 1 2 3 4 5

But when you print that object you don’t see the comment attribute. Why? Is the attribute missing, or is there something else special about it? (Hint: try using help.)

A: The documentation states (see ?comment):

Contrary to other attributes, the comment is not printed (by print or print.default).

Also, from ?attributes:

Note that some attributes (namely class, comment, dim, dimnames, names, row.names and tsp) are treated specially and have restrictions on the values which can be set.

We can retrieve comment attributes by calling them explicitly:

foo <- structure(1:5, comment = "my attribute")

attributes(foo)
#> $comment
#> [1] "my attribute"
attr(foo, which = "comment")
#> [1] "my attribute"

2.3 S3 atomic vectors

Q1: What sort of object does table() return? What is its type? What attributes does it have? How does the dimensionality change as you tabulate more variables?

A: table() returns a contingency table of its input variables. It is implemented as an integer vector with class table and dimensions (which makes it act like an array). Its attributes are dim (dimensions) and dimnames (one name for each input column). The dimensions correspond to the number of unique values (factor levels) in each input variable.

x <- table(mtcars[c("vs", "cyl", "am")])

typeof(x)
#> [1] "integer"
attributes(x)
#> $dim
#> [1] 2 3 2
#> 
#> $dimnames
#> $dimnames$vs
#> [1] "0" "1"
#> 
#> $dimnames$cyl
#> [1] "4" "6" "8"
#> 
#> $dimnames$am
#> [1] "0" "1"
#> 
#> 
#> $class
#> [1] "table"

# Subset x like it's an array
x[ , , 1]
#>    cyl
#> vs   4  6  8
#>   0  0  0 12
#>   1  3  4  0
x[ , , 2]
#>    cyl
#> vs  4 6 8
#>   0 1 3 2
#>   1 7 0 0

Q2: What happens to a factor when you modify its levels?

f1 <- factor(letters)
levels(f1) <- rev(levels(f1))

A: The underlying integer values stay the same, but the levels are changed, making it look like the data has changed.

f1 <- factor(letters)
f1
#>  [1] a b c d e f g h i j k l m n o p q r s t u v w x y z
#> Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
as.integer(f1)
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26

levels(f1) <- rev(levels(f1))
f1
#>  [1] z y x w v u t s r q p o n m l k j i h g f e d c b a
#> Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a
as.integer(f1)
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26

Q3: What does this code do? How do f2 and f3 differ from f1?

f2 <- rev(factor(letters))

f3 <- factor(letters, levels = rev(letters))

A: For f2 and f3 either the order of the factor elements or its levels are being reversed. For f1 both transformations are occurring.

# Reverse element order
(f2 <- rev(factor(letters)))
#>  [1] z y x w v u t s r q p o n m l k j i h g f e d c b a
#> Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
as.integer(f2)
#>  [1] 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2
#> [26]  1

# Reverse factor levels (when creating factor)
(f3 <- factor(letters, levels = rev(letters)))
#>  [1] a b c d e f g h i j k l m n o p q r s t u v w x y z
#> Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a
as.integer(f3)
#>  [1] 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2
#> [26]  1

2.4 Lists

Q1: List all the ways that a list differs from an atomic vector.

A: To summarise:

  • Atomic vectors are always homogeneous (all elements must be of the same type). Lists may be heterogeneous (the elements can be of different types) as described in the introduction of the vectors chapter.

  • Atomic vectors point to one address in memory, while lists contain a separate reference for each element. (This was described in the list sections of the vectors and the names and values chapters.)

    lobstr::ref(1:2)
    #> [1:0x7fcd936f6e80] <int>
    lobstr::ref(list(1:2, 2))
    #> █ [1:0x7fcd93d53048] <list> 
    #> ├─[2:0x7fcd91377e40] <int> 
    #> └─[3:0x7fcd93b41eb0] <dbl>
  • Subsetting with out-of-bounds and NA values leads to different output. For example, [ returns NA for atomics and NULL for lists. (This is described in more detail within the subsetting chapter.)

    # Subsetting atomic vectors
    (1:2)[3]
    #> [1] NA
    (1:2)[NA]
    #> [1] NA NA
    
    # Subsetting lists
    as.list(1:2)[3]
    #> [[1]]
    #> NULL
    as.list(1:2)[NA]
    #> [[1]]
    #> NULL
    #> 
    #> [[2]]
    #> NULL

Q2: Why do you need to use unlist() to convert a list to an atomic vector? Why doesn’t as.vector() work?

A: A list is already a vector, though not an atomic one!

Note that as.vector() and is.vector() use different definitions of “vector!”

is.vector(as.vector(mtcars))
#> [1] FALSE

Q3: Compare and contrast c() and unlist() when combining a date and date-time into a single vector.

A: Date and date-time objects are both built upon doubles. While dates store the number of days since the reference date 1970-01-01 (also known as “the Epoch”) in days, date-time-objects (POSIXct) store the time difference to this date in seconds.

date    <- as.Date("1970-01-02")
dttm_ct <- as.POSIXct("1970-01-01 01:00", tz = "UTC")

# Internal representations
unclass(date)
#> [1] 1
unclass(dttm_ct)
#> [1] 3600
#> attr(,"tzone")
#> [1] "UTC"

As the c() generic only dispatches on its first argument, combining date and date-time objects via c() could lead to surprising results in older R versions (pre R 4.0.0):

# Output in R version 3.6.2
c(date, dttm_ct)  # equal to c.Date(date, dttm_ct) 
#> [1] "1970-01-02" "1979-11-10"
c(dttm_ct, date)  # equal to c.POSIXct(date, dttm_ct)
#> [1] "1970-01-01 02:00:00 CET" "1970-01-01 01:00:01 CET"

In the first statement above c.Date() is executed, which incorrectly treats the underlying double of dttm_ct (3600) as days instead of seconds. Conversely, when c.POSIXct() is called on a date, one day is counted as one second only.

We can highlight these mechanics by the following code:

# Output in R version 3.6.2
unclass(c(date, dttm_ct))  # internal representation
#> [1] 1 3600
date + 3599
#> "1979-11-10"

As of R 4.0.0 these issues have been resolved and both methods now convert their input first into POSIXct and Date, respectively.

c(dttm_ct, date)
#> [1] "1970-01-01 01:00:00 UTC" "1970-01-02 00:00:00 UTC"
unclass(c(dttm_ct, date))
#> [1]  3600 86400

c(date, dttm_ct)
#> [1] "1970-01-02" "1970-01-01"
unclass(c(date, dttm_ct))
#> [1] 1 0

However, as c() strips the time zone (and other attributes) of POSIXct objects, some caution is still recommended.

(dttm_ct <- as.POSIXct("1970-01-01 01:00", tz = "HST"))
#> [1] "1970-01-01 01:00:00 HST"
attributes(c(dttm_ct))
#> $class
#> [1] "POSIXct" "POSIXt"

A package that deals with these kinds of problems in more depth and provides a structural solution for them is the {vctrs} package9 which is also used throughout the tidyverse.10

Let’s look at unlist(), which operates on list input.

# Attributes are stripped
unlist(list(date, dttm_ct))  
#> [1]     1 39600

We see again that dates and date-times are internally stored as doubles. Unfortunately, this is all we are left with, when unlist strips the attributes of the list.

To summarise: c() coerces types and strips time zones. Errors may have occurred in older R versions because of inappropriate method dispatch/immature methods. unlist() strips attributes.

2.5 Data frames and tibbles

Q1: Can you have a data frame with zero rows? What about zero columns?

A: Yes, you can create these data frames easily; either during creation or via subsetting. Even both dimensions can be zero.

Create a 0-row, 0-column, or an empty data frame directly:

data.frame(a = integer(), b = logical())
#> [1] a b
#> <0 rows> (or 0-length row.names)

data.frame(row.names = 1:3)  # or data.frame()[1:3, ]
#> data frame with 0 columns and 3 rows

data.frame()
#> data frame with 0 columns and 0 rows

Create similar data frames via subsetting the respective dimension with either 0, NULL, FALSE or a valid 0-length atomic (logical(0), character(0), integer(0), double(0)). Negative integer sequences would also work. The following example uses a zero:

mtcars[0, ]
#>  [1] mpg  cyl  disp hp   drat wt   qsec vs   am   gear carb
#> <0 rows> (or 0-length row.names)

mtcars[ , 0]  # or mtcars[0]
#> data frame with 0 columns and 32 rows

mtcars[0, 0]
#> data frame with 0 columns and 0 rows

Q2: What happens if you attempt to set rownames that are not unique?

A: Matrices can have duplicated row names, so this does not cause problems.

Data frames, however, require unique rownames and you get different results depending on how you attempt to set them. If you set them directly or via row.names(), you get an error:

data.frame(row.names = c("x", "y", "y"))
#> Error in data.frame(row.names = c("x", "y", "y")): duplicate row.names: y

df <- data.frame(x = 1:3)
row.names(df) <- c("x", "y", "y")
#> Warning: non-unique value when setting 'row.names': 'y'
#> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed

If you use subsetting, [ automatically deduplicates:

row.names(df) <- c("x", "y", "z")
df[c(1, 1, 1), , drop = FALSE]
#>     x
#> x   1
#> x.1 1
#> x.2 1

Q3: If df is a data frame, what can you say about t(df), and t(t(df))? Perform some experiments, making sure to try different column types.

A: Both of t(df) and t(t(df)) will return matrices:

df <- data.frame(x = 1:3, y = letters[1:3])
is.matrix(df)
#> [1] FALSE
is.matrix(t(df))
#> [1] TRUE
is.matrix(t(t(df)))
#> [1] TRUE

The dimensions will respect the typical transposition rules:

dim(df)
#> [1] 3 2
dim(t(df))
#> [1] 2 3
dim(t(t(df)))
#> [1] 3 2

Because the output is a matrix, every column is coerced to the same type. (It is implemented within t.data.frame() via as.matrix() which is described below).

df
#>   x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
t(df)
#>   [,1] [,2] [,3]
#> x "1"  "2"  "3" 
#> y "a"  "b"  "c"

Q4: What does as.matrix() do when applied to a data frame with columns of different types? How does it differ from data.matrix()?

A: The type of the result of as.matrix depends on the types of the input columns (see ?as.matrix):

The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column, applying as.vector to factors and format to other non-character columns. Otherwise the usual coercion hierarchy (logical < integer < double < complex) will be used, e.g. all-logical data frames will be coerced to a logical matrix, mixed logical-integer will give an integer matrix, etc.

On the other hand, data.matrix will always return a numeric matrix (see ?data.matrix()).

Return the matrix obtained by converting all the variables in a data frame to numeric mode and then binding them together as the columns of a matrix. Factors and ordered factors are replaced by their internal codes. […] Character columns are first converted to factors and then to integers.

We can illustrate and compare the mechanics of these functions using a concrete example. as.matrix() makes it possible to retrieve most of the original information from the data frame but leaves us with characters. To retrieve all information from data.matrix()’s output, we would need a lookup table for each column.

df_coltypes <- data.frame(
  a = c("a", "b"),
  b = c(TRUE, FALSE),
  c = c(1L, 0L),
  d = c(1.5, 2),
  e = factor(c("f1", "f2"))
)

as.matrix(df_coltypes)
#>      a   b       c   d     e   
#> [1,] "a" "TRUE"  "1" "1.5" "f1"
#> [2,] "b" "FALSE" "0" "2.0" "f2"
data.matrix(df_coltypes)
#>      a b c   d e
#> [1,] 1 1 1 1.5 1
#> [2,] 2 0 0 2.0 2