2 Vectors
2.1 Atomic vectors
Q1: How do you create raw and complex scalars? (See ?raw
and ?complex
.)
A: In R, scalars are represented as vectors of length one. However, there’s no built-in syntax like there is for logicals, integers, doubles, and character vectors to create individual raw and complex values. Instead, you have to create them by calling a function.
For raw vectors you can use either as.raw()
or charToRaw()
to create them from numeric or character values.
In the case of complex numbers, real and imaginary parts may be provided directly to the complex()
constructor.
complex(length.out = 1, real = 1, imaginary = 1)
#> [1] 1+1i
You can create purely imaginary numbers (e.g.) 1i
, but there is no way to create complex numbers without +
(e.g. 1i + 1
).
Q2: Test your knowledge of vector coercion rules by predicting the output of the following uses of c()
:
c(1, FALSE) # will be coerced to double -> 1 0
c("a", 1) # will be coerced to character -> "a" "1"
c(TRUE, 1L) # will be coerced to integer -> 1 1
Q3: Why is 1 == "1"
true? Why is -1 < FALSE
true? Why is "one" < 2
false?
A: These comparisons are carried out by operator-functions (==
, <
), which coerce their arguments to a common type. In the examples above, these types will be character, double and character: 1
will be coerced to "1"
, FALSE
is represented as 0
and 2
turns into "2"
(and numbers precede letters in lexicographic order (may depend on locale)).
Q4: Why is the default missing value, NA
, a logical vector? What’s special about logical vectors? (Hint: think about c(FALSE, NA_character_)
.)
A: The presence of missing values shouldn’t affect the type of an object. Recall that there is a type-hierarchy for coercion from character → double → integer → logical. When combining NA
s with other atomic types, the NA
s will be coerced to integer (NA_integer_
), double (NA_real_
) or character (NA_character_
) and not the other way round. If NA
were a character and added to a set of other values all of these would be coerced to character as well.
Q5: Precisely what do is.atomic()
, is.numeric()
, and is.vector()
test for?
A: The documentation states that:
-
is.atomic()
tests if an object is an atomic vector (as defined in Advanced R) or isNULL
(!). -
is.numeric()
tests if an object has type integer or double and is not of classfactor
,Date
,POSIXt
ordifftime
. -
is.vector()
tests if an object is a vector (as defined in Advanced R) or an expression and has no attributes, apart from names.
Atomic vectors are defined in Advanced R as objects of type logical, integer, double, complex, character or raw. Vectors are defined as atomic vectors or lists.
2.2 Attributes
Q1: How is setNames()
implemented? How is unname()
implemented? Read the source code.
A: setNames()
is implemented as:
setNames <- function(object = nm, nm) {
names(object) <- nm
object
}
Because the data argument comes first, setNames()
also works well with the magrittr-pipe operator. When no first argument is given, the result is a named vector (this is rather untypical as required arguments usually come first):
unname()
is implemented in the following way:
unname <- function(obj, force = FALSE) {
if (!is.null(names(obj)))
names(obj) <- NULL
if (!is.null(dimnames(obj)) && (force || !is.data.frame(obj)))
dimnames(obj) <- NULL
obj
}
unname()
removes existing names (or dimnames) by setting them to NULL
.
Q2: What does dim()
return when applied to a 1-dimensional vector? When might you use NROW()
or NCOL()
?
A: From ?nrow
:
dim()
will returnNULL
when applied to a 1d vector.
One may want to use NROW()
or NCOL()
to handle atomic vectors, lists and NULL values in the same way as one column matrices or data frames. For these objects nrow()
and ncol()
return NULL
:
x <- 1:10
# Return NULL
nrow(x)
#> NULL
ncol(x)
#> NULL
# Pretend it's a column vector
NROW(x)
#> [1] 10
NCOL(x)
#> [1] 1
Q3: How would you describe the following three objects? What makes them different to 1:5
?
x1 <- array(1:5, c(1, 1, 5)) # 1 row, 1 column, 5 in third dim.
x2 <- array(1:5, c(1, 5, 1)) # 1 row, 5 columns, 1 in third dim.
x3 <- array(1:5, c(5, 1, 1)) # 5 rows, 1 column, 1 in third dim.
A: These are all “one dimensional.” If you imagine a 3d cube, x1
is in the x-dimension, x2
is in the y-dimension, and x3
is in the z-dimension. In contrast to 1:5
, x1
, x2
and x3
have a dim
attribute.
Q4: An early draft used this code to illustrate structure()
:
structure(1:5, comment = "my attribute")
#> [1] 1 2 3 4 5
But when you print that object you don’t see the comment attribute. Why? Is the attribute missing, or is there something else special about it? (Hint: try using help.)
A: The documentation states (see ?comment
):
Contrary to other attributes, the comment is not printed (by print or print.default).
Also, from ?attributes
:
Note that some attributes (namely class, comment, dim, dimnames, names, row.names and tsp) are treated specially and have restrictions on the values which can be set.
We can retrieve comment attributes by calling them explicitly:
foo <- structure(1:5, comment = "my attribute")
attributes(foo)
#> $comment
#> [1] "my attribute"
attr(foo, which = "comment")
#> [1] "my attribute"
2.3 S3 atomic vectors
Q1: What sort of object does table()
return? What is its type? What attributes does it have? How does the dimensionality change as you tabulate more variables?
A: table()
returns a contingency table of its input variables. It is implemented as an integer vector with class table
and dimensions (which makes it act like an array). Its attributes are dim
(dimensions) and dimnames
(one name for each input column). The dimensions correspond to the number of unique values (factor levels) in each input variable.
x <- table(mtcars[c("vs", "cyl", "am")])
typeof(x)
#> [1] "integer"
attributes(x)
#> $dim
#> [1] 2 3 2
#>
#> $dimnames
#> $dimnames$vs
#> [1] "0" "1"
#>
#> $dimnames$cyl
#> [1] "4" "6" "8"
#>
#> $dimnames$am
#> [1] "0" "1"
#>
#>
#> $class
#> [1] "table"
# Subset x like it's an array
x[ , , 1]
#> cyl
#> vs 4 6 8
#> 0 0 0 12
#> 1 3 4 0
x[ , , 2]
#> cyl
#> vs 4 6 8
#> 0 1 3 2
#> 1 7 0 0
Q2: What happens to a factor when you modify its levels?
A: The underlying integer values stay the same, but the levels are changed, making it look like the data has changed.
f1 <- factor(letters)
f1
#> [1] a b c d e f g h i j k l m n o p q r s t u v w x y z
#> Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
as.integer(f1)
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26
levels(f1) <- rev(levels(f1))
f1
#> [1] z y x w v u t s r q p o n m l k j i h g f e d c b a
#> Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a
as.integer(f1)
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26
Q3: What does this code do? How do f2
and f3
differ from f1
?
A: For f2
and f3
either the order of the factor elements or its levels are being reversed. For f1
both transformations are occurring.
# Reverse element order
(f2 <- rev(factor(letters)))
#> [1] z y x w v u t s r q p o n m l k j i h g f e d c b a
#> Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
as.integer(f2)
#> [1] 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
#> [26] 1
# Reverse factor levels (when creating factor)
(f3 <- factor(letters, levels = rev(letters)))
#> [1] a b c d e f g h i j k l m n o p q r s t u v w x y z
#> Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a
as.integer(f3)
#> [1] 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
#> [26] 1
2.4 Lists
Q1: List all the ways that a list differs from an atomic vector.
A: To summarise:
Atomic vectors are always homogeneous (all elements must be of the same type). Lists may be heterogeneous (the elements can be of different types) as described in the introduction of the vectors chapter.
-
Atomic vectors point to one address in memory, while lists contain a separate reference for each element. (This was described in the list sections of the vectors and the names and values chapters.)
-
Subsetting with out-of-bounds and
NA
values leads to different output. For example,[
returnsNA
for atomics andNULL
for lists. (This is described in more detail within the subsetting chapter.)
Q2: Why do you need to use unlist()
to convert a list to an atomic vector? Why doesn’t as.vector()
work?
A: A list is already a vector, though not an atomic one!
Note that as.vector()
and is.vector()
use different definitions of
“vector!”
Q3: Compare and contrast c()
and unlist()
when combining a date and date-time into a single vector.
A: Date and date-time objects are both built upon doubles. While dates store the number of days since the reference date 1970-01-01 (also known as “the Epoch”) in days, date-time-objects (POSIXct) store the time difference to this date in seconds.
date <- as.Date("1970-01-02")
dttm_ct <- as.POSIXct("1970-01-01 01:00", tz = "UTC")
# Internal representations
unclass(date)
#> [1] 1
unclass(dttm_ct)
#> [1] 3600
#> attr(,"tzone")
#> [1] "UTC"
As the c()
generic only dispatches on its first argument, combining date and date-time objects via c()
could lead to surprising results in older R versions (pre R 4.0.0):
# Output in R version 3.6.2
c(date, dttm_ct) # equal to c.Date(date, dttm_ct)
#> [1] "1970-01-02" "1979-11-10"
c(dttm_ct, date) # equal to c.POSIXct(date, dttm_ct)
#> [1] "1970-01-01 02:00:00 CET" "1970-01-01 01:00:01 CET"
In the first statement above c.Date()
is executed, which incorrectly treats the underlying double of dttm_ct
(3600) as days instead of seconds. Conversely, when c.POSIXct()
is called on a date, one day is counted as one second only.
We can highlight these mechanics by the following code:
# Output in R version 3.6.2
unclass(c(date, dttm_ct)) # internal representation
#> [1] 1 3600
date + 3599
#> "1979-11-10"
As of R 4.0.0 these issues have been resolved and both methods now convert their input first into POSIXct
and Date
, respectively.
c(dttm_ct, date)
#> [1] "1970-01-01 01:00:00 UTC" "1970-01-02 00:00:00 UTC"
unclass(c(dttm_ct, date))
#> [1] 3600 86400
c(date, dttm_ct)
#> [1] "1970-01-02" "1970-01-01"
unclass(c(date, dttm_ct))
#> [1] 1 0
However, as c()
strips the time zone (and other attributes) of POSIXct
objects, some caution is still recommended.
(dttm_ct <- as.POSIXct("1970-01-01 01:00", tz = "HST"))
#> [1] "1970-01-01 01:00:00 HST"
attributes(c(dttm_ct))
#> $class
#> [1] "POSIXct" "POSIXt"
A package that deals with these kinds of problems in more depth and provides a structural solution for them is the {vctrs}
package9 which is also used throughout the tidyverse.10
Let’s look at unlist()
, which operates on list input.
We see again that dates and date-times are internally stored as doubles. Unfortunately, this is all we are left with, when unlist strips the attributes of the list.
To summarise: c()
coerces types and strips time zones. Errors may have occurred in older R versions because of inappropriate method dispatch/immature methods. unlist()
strips attributes.
2.5 Data frames and tibbles
Q1: Can you have a data frame with zero rows? What about zero columns?
A: Yes, you can create these data frames easily; either during creation or via subsetting. Even both dimensions can be zero.
Create a 0-row, 0-column, or an empty data frame directly:
data.frame(a = integer(), b = logical())
#> [1] a b
#> <0 rows> (or 0-length row.names)
data.frame(row.names = 1:3) # or data.frame()[1:3, ]
#> data frame with 0 columns and 3 rows
data.frame()
#> data frame with 0 columns and 0 rows
Create similar data frames via subsetting the respective dimension with either 0
, NULL
, FALSE
or a valid 0-length atomic (logical(0)
, character(0)
, integer(0)
, double(0)
). Negative integer sequences would also work. The following example uses a zero:
mtcars[0, ]
#> [1] mpg cyl disp hp drat wt qsec vs am gear carb
#> <0 rows> (or 0-length row.names)
mtcars[ , 0] # or mtcars[0]
#> data frame with 0 columns and 32 rows
mtcars[0, 0]
#> data frame with 0 columns and 0 rows
Q2: What happens if you attempt to set rownames that are not unique?
A: Matrices can have duplicated row names, so this does not cause problems.
Data frames, however, require unique rownames and you get different results depending on how you attempt to set them. If you set them directly or via row.names()
, you
get an error:
data.frame(row.names = c("x", "y", "y"))
#> Error in data.frame(row.names = c("x", "y", "y")): duplicate row.names: y
df <- data.frame(x = 1:3)
row.names(df) <- c("x", "y", "y")
#> Warning: non-unique value when setting 'row.names': 'y'
#> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
If you use subsetting, [
automatically deduplicates:
Q3: If df
is a data frame, what can you say about t(df)
, and t(t(df))
? Perform some experiments, making sure to try different column types.
A: Both of t(df)
and t(t(df))
will return matrices:
df <- data.frame(x = 1:3, y = letters[1:3])
is.matrix(df)
#> [1] FALSE
is.matrix(t(df))
#> [1] TRUE
is.matrix(t(t(df)))
#> [1] TRUE
The dimensions will respect the typical transposition rules:
Because the output is a matrix, every column is coerced to the same type. (It is implemented within t.data.frame()
via as.matrix()
which is described below).
df
#> x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
t(df)
#> [,1] [,2] [,3]
#> x "1" "2" "3"
#> y "a" "b" "c"
Q4: What does as.matrix()
do when applied to a data frame with columns of different types? How does it differ from data.matrix()
?
A: The type of the result of as.matrix
depends on the types of the input columns (see ?as.matrix
):
The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column, applying as.vector to factors and format to other non-character columns. Otherwise the usual coercion hierarchy (logical < integer < double < complex) will be used, e.g. all-logical data frames will be coerced to a logical matrix, mixed logical-integer will give an integer matrix, etc.
On the other hand, data.matrix
will always return a numeric matrix (see ?data.matrix()
).
Return the matrix obtained by converting all the variables in a data frame to numeric mode and then binding them together as the columns of a matrix. Factors and ordered factors are replaced by their internal codes. […] Character columns are first converted to factors and then to integers.
We can illustrate and compare the mechanics of these functions using a concrete example. as.matrix()
makes it possible to retrieve most of the original information from the data frame but leaves us with characters. To retrieve all information from data.matrix()
’s output, we would need a lookup table for each column.
df_coltypes <- data.frame(
a = c("a", "b"),
b = c(TRUE, FALSE),
c = c(1L, 0L),
d = c(1.5, 2),
e = factor(c("f1", "f2"))
)
as.matrix(df_coltypes)
#> a b c d e
#> [1,] "a" "TRUE" "1" "1.5" "f1"
#> [2,] "b" "FALSE" "0" "2.0" "f2"
data.matrix(df_coltypes)
#> a b c d e
#> [1,] 1 1 1 1.5 1
#> [2,] 2 0 0 2.0 2