3 Vectors

3.1 Atomic vectors

  1. Q: How do you create scalars of type raw and complex? (See ?raw and ?complex)

    A: In R scalars are represented as vectors of length one. For raw and complex types these can be created via raw() and complex(), i.e.:

    Raw vectors can easily be created from numeric or character values.

    For complex numbers real and imaginary parts may be provided directly.

  2. Q: Test your knowledge of vector coercion rules by predicting the output of the following uses of c():

  3. Q: Why is 1 == "1" true? Why is -1 < FALSE true? Why is "one" < 2 false?

    A: These comparisons are carried out by operator-functions, which coerce their arguments to a common type. In the examples above these cases will be character, double and character: 1 will be coerced to "1", FALSE is represented as 0 and 2 turns into "2" (and numerals precede letters in the lexicographic order (may depend on locale)).

  4. Q: Why is the default missing value, NA, a logical vector? What’s special about logical vectors? (Hint: think about c(FALSE, NA_character_).)

    A: The presence of missing values shouldn´t affect the type of an object. Recall that there is a type-hierarchy for coercion from character >> double >> integer >> logical. When combining NAs with other atomic types, the NAs will be coerced to integer (NA_integer_), double (NA_real_) or character (NA_character_) and not the other way round. If NA was a character and added to a set of other values all of these would be coerced to character as well.

  5. Q: Precisely what do is.atomic(), is.numeric(), and is.vector() test for?

    A: The documentation states that:
    • is.atomic() tests if an object has one of these types: "logical", "integer", "double", "complex", "character", "raw" or "NULL" (!).
    • is.numeric() tests if an object has integer or double type and is not of "factor", "Date", "POSIXt" or "difftime" class.
    • is.vector() tests if an object has no attributes, except of names and if its mode() is atomic ("logical", "integer", "double", "complex", "character", "raw"), "list" or "expression".

3.2 Attributes

  1. Q: How is setNames() implemented? How is unname() implemented? Read the source code.

    A: setNames() is implemented as:

    Because the data argument comes first setNames() also works well with the magrittr-pipe operator. When no first argument is given, the result is a named vector:

    setNames() only affects the names-attribute and ignores other more specific name-related attributes such as dimnames (for matrices and arrays).

    unname() is implemented in the following way:

    unname() removes existing names- and dimnames-attributes. By default the dimnames attribute (names and row names) won’t be affected for data frames.

  2. Q: What does dim() return when applied to a 1d vector? When might you use NROW() or NCOL()?

    A: From ?nrow:

    dim() will return NULL when applied to a 1d vector.

    One may want to use NROW() or NCOL() to handle atomic vectors, lists and NULL values similar to one column matrices or data frames. For these objects nrow() and ncol() return NULL. This may occur in interactive data analysis, while subsetting data frames.

  3. Q: How would you describe the following three objects? What makes them different to 1:5?

    A: These objects have the class array instead of vector. Their dimensions are stored in the dim attribute.

  4. Q: An early draft used this code to illustrate structure():

    But when you print that object you don’t see the comment attribute. Why? Is the attribute missing, or is there something else special about it? (Hint: try using help.)

    A: The documentation states (see ?comment):

    Contrary to other attributes, the comment is not printed (by print or print.default).

    Also, from ?attributes:

    Note that some attributes (namely class, comment, dim, dimnames, names, row.names and tsp) are treated specially and have restrictions on the values which can be set.

    We can retrieve comment attributes by calling them explicitly:

3.3 S3 atomic vectors

  1. Q: What sort of object does table() return? What is its type? What attributes does it have? How does the dimensionality change as you tabulate more variables?

    A: table() returns a contingency table of its input variables, which has the class "table". Internally it is represented as an array (implicit class) of integers (type) with the attributes dim (dimension of the underlying array) and dimnames (one name for each input column). The dimensions correspond to the number of unique values (factor levels) in each input variable.

  2. Q: What happens to a factor when you modify its levels?

    A: Both the elements of the factor and as well as its levels are being reversed:

  3. Q: What does this code do? How do f2 and f3 differ from f1?

    A: For f2 and f3 either the order of the factor elements or its levels are being reversed. For f1 both transformations are occurring.

3.4 Lists

  1. Q: List all the ways that a list differs from an atomic vector.

    A: To summarise:
    • Atomic vectors are always homogeneous (all elements must be of the same type). Lists may be heterogeneous (the elements can be of different types).
    • Atomic vectors point to one address in memory, while lists contain a separate references for each element.
    • Subsetting with out of bound values or NAs leads to NAs for atomics and NULL values for lists.
  2. Q: Why do you need to use unlist() to convert a list to an atomic vector? Why doesn’t as.vector() work?

    A: A list is also a vector, though not an atomic one!

  3. Q: Compare and contrast c() and unlist() when combining a date and date-time into a single vector.

    A: Date and date-time objects are build upon doubles. Dates are represented as days, while date-time-objects (POSIXct) represent seconds (counted in respect to the reference date 1970-01-01, also known as “The Epoch”).

    When combining these objects method-dispatch leads to surprising output:

    The generic function dispatches based on the class of its first argument. When c.Date() is executed, dttm_ct is converted to a date, but the 3600 seconds are mistaken for 3600 days! When c.POSIXct() is called on date, one day counts as one second only, as illustrated by the following line:

    Some of these problems may be avoided via explicit conversion of the classes:

    Let’s look at unlist(), which operates on list input.

    We see that internally dates(-times) are stored as doubles. Unfortunately this is all we are left with, when unlist strips the attributes of the list.

    To summarise: c() coerces types and errors may occur because of inappropriate method dispatch. unlist() strips attributes.

3.5 Data frames and tibbles

  1. Q: Can you have a data frame with 0 rows? What about 0 columns?

    A: Yes, you can create these data frames easily and in many ways. Even both dimensions can be 0. E.g. you might subset the respective dimension with either 0, NULL or a valid 0-length atomic (logical(0), character(0), integer(0), double(0)). Negative integer sequences would also work. The following example uses the recycling rules for logical subsetting:

    Empty data frames can also be created directly (without subsetting):

  2. Q: What happens if you attempt to set rownames that are not unique?

    A For matrices this will work without any problems. For data frames it is not possible and what happens depends on the approach. When using the row.names<- replacement function, no further arguments can be set and the underlying .rowNamesDF<- will throw an error (and an additional warning):

    However, by calling .rowNamesDF<- directly one can set the make.names argument to NA or TRUE. When set to NA, any non unique row name will trigger the new row names to become seq_len(nrow(x)). When make.names = TRUE, row names will automatically converted into unique ones via make.names(value, unique = TRUE). The same behaviour is caused, when a matrix with non unique row names is converted into a data frame.

  3. Q: If df is a data frame, what can you say about t(df), and t(t(df))? Perform some experiments, making sure to try different column types.

    A Both will return matrices with dimensions regarding the typical transposition rules. As t() uses as.matrix.data.frame() for the preprocessing in front of applying t.default() and elements of matrices need to be of the same type, all elements will be coerced in the usual order (logical << integer << double << character). Factors, dates and datetimes are treated as characters during coercion.

  4. Q: What does as.matrix() do when applied to a data frame with columns of different types? How does it differ from data.matrix()?

    A: From ?as.matrix:

    The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column, applying as.vector to factors and format to other non-character columns. Otherwise the usual coercion hierarchy (logical < integer < double < complex) will be used, e.g., all-logical data frames will be coerced to a logical matrix, mixed logical-integer will give a integer matrix, etc.

    Let´s transform a dummy data frame into a character matrix. Note that format() is applied to the characters, which may complicate conversion back to the previous type. (For example TRUE is transformed to " TRUE" (starting with a space))

    From ?as.data.matrix:

    Return the matrix obtained by converting all the variables in a data frame to numeric mode and then binding them together as the columns of a matrix. Factors and ordered factors are replaced by their internal codes.

    data.matrix() returns a numeric matrix, where characters are replace by missing values: