2 Vectors

2.1 Atomic vectors

1. Q: How do you create scalars of type raw and complex? (See ?raw and ?complex)

A: In R scalars are represented as vectors of length one. For raw and complex types these can be created via raw() and complex(), i.e.:

raw(1)
#> [1] 00
complex(1)
#> [1] 0+0i

For raw vectors it’s easy to coerce numeric or character scalars to raw.

as.raw(42)
#> [1] 2a
charToRaw("A")
#> [1] 41

For complex numbers real and imaginary parts can be provided directly.

complex(length.out = 1, real = 1, imaginary = 1)
#> [1] 1+1i
2. Q: Test your knowledge of vector coercion rules by predicting the output of the following uses of c():

c(1, FALSE)      # will be coerced to numeric   -> 1 0
c("a", 1)        # will be coerced to character -> "a" "1"
c(TRUE, 1L)      # will be coerced to integer   -> 1 1
3. Q: Why is 1 == "1" true? Why is -1 < FALSE true? Why is "one" < 2 false?

A: These comparisons are carried out by operator-functions, which coerce their arguments to a common type. In the examples above these cases will be character, double and character: 1 will be coerced to "1", FALSE is represented as 0 and 2 becomes "2" (in ASCII numerals preceed letter).

4. Q: Why is the default missing value, NA, a logical vector? What’s special about logical vectors? (Hint: think about c(FALSE, NA_character_).)

A: It is a practical thought. When you combine NAs in c() with other atomic types they will be coerced like TRUE and FALSE to integer (NA_integer_), double (NA_real_), complex (NA_complex_) and character (NA_character_). Recall that in R there is a hierarchy of recursion that goes logical >> integer >> double >> character. If NA was a character and provided as part of a set of other values, all of these would be coerced into character as well. Making NA a logical means that involving an NA in a dataset (which happens often) will not affect coercion.

5. Q: Precisely what do is.atomic(), is.numeric(), and is.vector() test for?

A: The documentation informs, that:
• is.atomic() tests if an object has one of these types: "logical", "integer", "double", "complex", "character", "raw" or "NULL" (!).
• is.numeric() tests if an object has integer or double type and is not of "factor", "Date", "POSIXt" or "difftime" class.
• is.vector() tests if an object has no attributes, except of names and if its mode() is atomic ("logical", "integer", "double", "complex", "character", "raw"), "list" or "expression".

2.2 Attributes

1. Q: How is setNames() implemented? How is unname() implemented? Read the source code.

A: setNames() is implemented as:

setNames
#> function (object = nm, nm)
#> {
#>     names(object) <- nm
#>     object
#> }
#> <bytecode: 0x4881f48>
#> <environment: namespace:stats>

As the data comes first setNames() also works well with the magrittr-pipe operator. When no first argument is given, the result is a named vector:

   setNames( , c("a", "b", "c"))
#>   a   b   c
#> "a" "b" "c"

However, the implemention also means that setNames() only affects the names attribute and not any other more specific naming-attributes like the dimnames attribute for matrices and arrays.

unname() is implemented in the following way:

unname
#> function (obj, force = FALSE)
#> {
#>     if (!is.null(names(obj)))
#>         names(obj) <- NULL
#>     if (!is.null(dimnames(obj)) && (force || !is.data.frame(obj)))
#>         dimnames(obj) <- NULL
#>     obj
#> }
#> <bytecode: 0x2dddbf8>
#> <environment: namespace:base>

If set, unname() removes the names and dimnames attributes from its input object. Note that the dimnames attribute (names and row names) won’t be affected for data frames, even if the documentation currently (R 3.5.1) mentions this in the case where force == TRUE is supplied. Instead the line dimnames(obj) <- NULL setting NULL as the replacement value in the underlying dimnames<-.data.frame function, raises the first error condition in the underlying source code:

dimnames<-.data.frame
#> function (x, value)
#> {
#>     d <- dim(x)
#>     if (!is.list(value) || length(value) != 2L)
#>         stop("invalid 'dimnames' given for data frame")
#>     value[[1L]] <- as.character(value[[1L]])
#>     value[[2L]] <- as.character(value[[2L]])
#>     if (d[[1L]] != length(value[[1L]]) || d[[2L]] != length(value[[2L]]))
#>         stop("invalid 'dimnames' given for data frame")
#>     row.names(x) <- value[[1L]]
#>     names(x) <- value[[2L]]
#>     x
#> }
#> <bytecode: 0x3a10418>
#> <environment: namespace:base>
2. Q: What does dim() return when applied to a 1d vector? When might you use NROW() or NCOL()?

A: From ?nrow:

dim() will return NULL when applied to a 1d vector.

One might want to use NROW() or NCOL() to handle atomic vectors, lists and NULL values (1 column, 0 rows) analog to a 1-column matrix / data frame. In these cases the alternatives nrow() and ncol() return NULL (consistently to the behaviour of dim()). When subsetting data frames (interactively) this might be convenient as it is not affected by and hence more robust regarding the default drop = TRUE(-idiom):

NROW(iris[, 1, drop = TRUE])
#> [1] 150
nrow(iris[, 1, drop = TRUE])
#> NULL
NCOL(iris[, 1, drop = TRUE])
#> [1] 1
ncol(iris[, 1, drop = TRUE])
#> NULL
3. Q: How would you describe the following three objects? What makes them different to 1:5?

x1 <- array(1:5, c(1, 1, 5))  # 1 row,  1 column,  5 in third dimension
x2 <- array(1:5, c(1, 5, 1))  # 1 row,  5 columns, 1 in third dimension
x3 <- array(1:5, c(5, 1, 1))  # 5 rows, 1 column,  1 in third dimension

A: They are of class array and so they have a dim attribute.

4. Q: An early draft used this code to illustrate structure():

structure(1:5, comment = "my attribute")
#> [1] 1 2 3 4 5

But when you print that object you don’t see the comment attribute. Why? Is the attribute missing, or is there something else special about it? (Hint: try using help.)

A: From the help of comment (?comment):

Contrary to other attributes, the comment is not printed (by print or print.default).

Also from the help of attributes (?attributes):

Note that some attributes (namely class, comment, dim, dimnames, names, row.names and tsp) are treated specially and have restrictions on the values which can be set.

Apart of that, we can get it easily when we are more specific, i.e.:

bla <- structure(1:5, comment = "my attribute")

attributes(bla)
#> \$comment
#> [1] "my attribute"
attr(bla, "comment")
#> [1] "my attribute"

2.3 S3 atomic vectors

1. Q: What sort of object does table() return? What is its type? What attributes does it have? How does the dimensionality change as you tabulate more variables?

A: table() returns a crosstabulation of its input. The result is an S3 table object, which is an array (implicit class) of integers (type) under the hood. Attributes are dim (dimension of the underlying array) and dimnames (one for each input column). The dimensionality equals to the number of unique values (accordingly factor levels) of the input arguments, i.e.:

dim(table(iris))
#> [1] 35 23 43 22  3
sapply(iris, function(x) length(unique(levels(as.factor(x)))))
#> Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species
#>           35           23           43           22            3
2. Q: What happens to a factor when you modify its levels?

f1 <- factor(letters)
levels(f1) <- rev(levels(f1))

A: Both, the entries of the factor and also its levels are being reversed:

f1
#>  [1] z y x w v u t s r q p o n m l k j i h g f e d c b a
#> Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a
3. Q: What does this code do? How do f2 and f3 differ from f1?

f2 <- rev(factor(letters)) # changes only the entries of the factor
f3 <- factor(letters, levels = rev(letters)) # changes only the levels of the factor

A: Unlike f1 f2 and f3 change only one thing. They change the order of the factor or its levels, but not both at the same time.

2.4 Lists

1. Q: List all the ways that a list differs from an atomic vector.

A:

• Atomic vectors are homogeneous (all contents must be of the same type). Lists are heterogeneous (all contents can be of different types).

• Atomic vectors point to one value, while lists contain references which point to one value each:

lobstr::ref(1:3)
#> [1:0x2e961d0] <int>
lobstr::ref(list(1:3,2,3))
#> █ [1:0x61b6658] <list>
#> ├─[2:0x62078c0] <int>
#> ├─[3:0x2e75460] <dbl>
#> └─[4:0x2e75498] <dbl>
• Subsetting with out of bound values and NAs leads to NAs for atomics and NULL values for lists:
(1:2)[3]
#> [1] NA
as.list(1:2)[3]
#> [[1]]
#> NULL

(1:2)[NA]
#> [1] NA NA
as.list(1:2)[NA]
#> [[1]]
#> NULL
#>
#> [[2]]
#> NULL
2. Q: Why do you need to use unlist() to convert a list to an atomic vector? Why doesn’t as.vector() work?

A: To get rid of (flatten) the nested structure as.vector() doesn’t work, because a list is already a vector.

3. Q: Compare and contrast c() and unlist() when combining a date and date-time into a single vector.

A: Date and date-time objects are build upon doubles. Dates are represented as days, while date-time-objects (POSIXct) represent seconds (counted in regard to the reference date 1970-01-01, also known as “The Epoch”).

Let’s define date and date-time objects:

date    <- as.Date("1970-01-02")
dttm_ct <- as.POSIXct("1970-01-01 01:00", tz = "UTC")

When combining these objects method-dispatch leads to suprising output:

c(date, dttm_ct)  # equal to c.Date(date, dttm_ct)
#> [1] "1970-01-02" "1979-11-10"
c(dttm_ct, date)  # equal to c.POSIXct(date, dttm_ct)
#> [1] "1970-01-01 01:00:00 UTC" "1970-01-01 00:00:01 UTC"

The generic function dispatches based on the class of its first argument. When c.Date() is executed, dttm_ct is converted to a date, but the 3600 seconds are mistaken for 3600 days! When c.POSIXct() is called on date, one day counts as one second only, as illustrated by the following line:

unclass(c(date, dttm_ct))  # internal representation
#> [1]    1 3600
date + 3599
#> [1] "1979-11-10"

Some of these problems may be avoided via explicit conversion of the classes:

c(as.Date(dttm_ct, tz = "UTC"), date)
#> [1] "1970-01-01" "1970-01-02"

Let’s look at unlist(), which operates on list input.

# attributes are stripped
unlist(list(date, dttm_ct))
#> [1]    1 3600

We see that internally dates(-times) are doubles. Unfortunately this is all we are left with, when unlist strips the attributes of the list. (This wouldn’t happen for vector input, but then we would have to combine the different classes into one vector, which is tricky as seen above.)

To summarise: c() coerces types and errors may occur because of inappropriate method dispatch. unlist() strips attributes.

2.5 Data frames and tibbles

1. Q: Can you have a data frame with 0 rows? What about 0 columns?

A: Yes, you can create one easily and in many ways. Also both dimensions can be 0. The fastest way is to subset the regarding dimension with one of 0, NULL or a valid 0-length atomic (logical(0), character(0), integer(0), double(0)). Also a negative integer sequence would work. Here we use the recycling rules for logical subsetting:

iris[FALSE,]
#> [1] Sepal.Length Sepal.Width  Petal.Length Petal.Width  Species
#> <0 rows> (or 0-length row.names)

iris[ , FALSE] # or iris[FALSE]
#> data frame with 0 columns and 150 rows

iris[FALSE, FALSE] # or just data.frame()
#> data frame with 0 columns and 0 rows
2. Q: What happens if you attempt to set rownames that are not unique?

A For matrices this will work without any problems. For data frames it is not possible and what happens depends on the approach. When using the row.names<- replacement function, no further arguments can be set and the underlying .rowNamesDF<- will throw an error (and an additional warning):

row.names(mtcars) <- rep(1, nrow(mtcars))
#> Warning: non-unique value when setting 'row.names': '1'
#> Error in .rowNamesDF<-(x, value = value): duplicate 'row.names' are not allowed

However, by calling .rowNamesDF<- directly one can set the make.names argument to NA or TRUE. When set to NA, any non unique row name will trigger the new row names to become seq_len(nrow(x)). When make.names = TRUE, row names will automatically converted into unique ones via make.names(value, unique = TRUE). The same behaviour is caused, when a matrix with non unique row names is converted into a data frame.

3. Q: If df is a data frame, what can you say about t(df), and t(t(df))? Perform some experiments, making sure to try different column types.

A Both will return matrices with dimensions regaring the typical transposition rules. As t() uses as.matrix.data.frame() for the preprocessing infront of applying t.default() and elements of matrices need to be of the same type, all elements get coerced in the usual order (logical << integer << double << character), while factors, dates and datetimes are treated as characters during coercion.

4. Q: What does as.matrix() do when applied to a data frame with columns of different types? How does it differ from data.matrix()?

A: From ?as.matrix:

The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column, applying as.vector to factors and format to other non-character columns. Otherwise the usual coercion hierarchy (logical < integer < double < complex) will be used, e.g., all-logical data frames will be coerced to a logical matrix, mixed logical-integer will give a integer matrix, etc.

To illustrate this, we create an easy example where the data frame gets coerced to a character matrix:

a <- c("a", "b", "c")
b <- c(TRUE, FALSE, FALSE)
c <- c("TRUE", "FALSE", "FALSE")
d <- c(1L, 0L, 2L)
e <- c(1.5, 2, 3)
f <- c("one" = 1, "two" = 2, "three" = 3)
g <- c("first" = 1, "second" = 2, "third" = 3)
h <- factor(c("f1", "f2", "f3"))

df_cols <- data.frame(a, b, c, d, e, f, g, h, stringsAsFactors = FALSE)

# Note that format is applied to the characters, which can complicate
# inverse conversion (back to the previous type).
# For example TRUE in the b variable becomes " TRUE" (starting with a space)
as.matrix(df_cols)
#>       a   b       c       d   e     f   g   h
#> one   "a" " TRUE" "TRUE"  "1" "1.5" "1" "1" "f1"
#> two   "b" "FALSE" "FALSE" "0" "2.0" "2" "2" "f2"
#> three "c" "FALSE" "FALSE" "2" "3.0" "3" "3" "f3"

From ?as.data.matrix:

Return the matrix obtained by converting all the variables in a data frame to numeric mode and then binding them together as the columns of a matrix. Factors and ordered factors are replaced by their internal codes.

So for data.matrix we’ll get a numerix matrix containing NAs for original character columns:

data.matrix(df_cols)
#> Warning in data.matrix(df_cols): NAs introduced by coercion

#> Warning in data.matrix(df_cols): NAs introduced by coercion
#>        a b  c d   e f g h
#> one   NA 1 NA 1 1.5 1 1 1
#> two   NA 0 NA 0 2.0 2 2 2
#> three NA 0 NA 2 3.0 3 3 3

2.6 Old exercises

1. Q: What does dim() return when applied to a vector?
A: NULL

2. Q: If is.matrix(x) is TRUE, what will is.array(x) return?
A: TRUE, as also documented in ?array:

A two-dimensional array is the same thing as a matrix.

3. Q: What attributes does a data frame possess?
A: names, row.names and class.

4. Q: What are the six types of atomic vector? How does a list differ from an atomic vector?
A: The six types are logical, integer, double, character, complex and raw. The elements of a list don’t have to be of the same type.

5. Q: What makes is.vector() and is.numeric() fundamentally different to is.list() and is.character()?
A: The first two tests don’t check for a specific type.