knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(tibble.print_min = 4, tibble.print_max = 4)
library(magrittr)
library(readxl)

A common problem when reading data into R is having multiple header rows in the source excel file. Take a look at this messy excel file from sheet 2 in the clippy.xlsx example file, which ships with readxl:

readxl_example("clippy.xlsx") %>% 
    read_excel(sheet = 2)

Problem: The column names are right, but the first row of data is actually not data- it is additional metadata. This row of metadata is also causing all the columns to import as character. In reality, death is a datetime and weight is numeric.

Solution: We can use the read_excel() function to read in the same file twice. In Step 1, we’ll create a character vector of the column names only. In Step 2, we’ll read in the actual data and skip the multiple header rows at the top. When we do this, we lose the column names, so we use the character vector of column names we created in Step 1 instead.

Step 1

In this step, we read in the first row only (by setting n_max = 0), extract the names from that row (using the names() function), and assign those to a character vector called cnames. This object now contains the correct column names that we’ll need in Step 2.

(cnames <- readxl_example("clippy.xlsx") %>% 
    read_excel(sheet = 2, n_max = 0) %>% 
    names())

Step 2

Now we'll read in all the rows except for the first two rows (using skip = 2), which contained the variable names and variable descriptions, and set the column names to cnames, which we created in Step 1. Now our column types are guessed correctly.

(clippy2 <- readxl_example("clippy.xlsx") %>% 
  read_excel(sheet = 2, skip = 2, col_names = cnames))

If you want a way to save that metadata without polluting your actual data, you can do a third read using n_max = 1:

(clippy_meta <- readxl_example("clippy.xlsx") %>% 
  read_excel(sheet = 2, n_max = 1))

This vignette has been adapted from a post on Alison Hill's blog.



hadley/readxl documentation built on Oct. 15, 2023, 10:28 a.m.