vignettes/formulas-and-formatted-numbers.md

title: "Formulas and formatted numbers" author: "Jenny Bryan" date: "2018-06-28" output: rmarkdown::html_vignette: toc: true keep_md: true vignette: > %\VignetteIndexEntry{Formulas and formatted numbers} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc}

Preliminaries

Load googlesheets and dplyr, from which we use the %>% pipe operator and which gives us nicer printing of data frames (tbl_dfs)

library(googlesheets)
suppressMessages(library(dplyr))

TL;DR

To see how your data comes in as a data frame without numeric formatting, try this:

gs_read(..., literal = FALSE)

The googlesheets package comes with functions to access a public Sheet with formulas and formatted numbers. Visit it in the browser or check out this screenshot.

We use it to demo the effect of literal in gs_read(). First we accept the default, which is literal = TRUE.

gs_ff() %>% 
  gs_read(range = cell_cols("B:C"))
#> Accessing worksheet titled 'Sheet1'.
#> Parsed with column specification:
#> cols(
#>   number_formatted = col_character(),
#>   number_rounded = col_double()
#> )
#> # A tibble: 5 x 2
#>   number_formatted number_rounded
#>   <chr>                     <dbl>
#> 1 654,321                    1.23
#> 2 12.34%                     2.35
#> 3 1.23E+09                   3.46
#> 4 3 1/7                      4.57
#> 5 $0.36                      5.68

See the problem? Numeric formatting causes the first column to come in as character.

Try again with literal = FALSE:

gs_ff() %>% 
  gs_read(literal = FALSE, range = cell_cols("B:C"))
#> Accessing worksheet titled 'Sheet1'.
#> Parsed with column specification:
#> cols(
#>   number_formatted = col_double(),
#>   number_rounded = col_double()
#> )
#> # A tibble: 5 x 2
#>   number_formatted number_rounded
#>              <dbl>          <dbl>
#> 1       654321               1.23
#> 2            0.123           2.35
#> 3   1234567890               3.46
#> 4            3.14            4.57
#> 5            0.36            5.68

Fixed it! First column is numeric. And we've also gained precision in the second column, previously lost to rounding.

If you want full access to cell contents, use gs_read_cellfeed(..., literal = FALSE) to get a data frame with one per cell. Then take your pick from value, input_value, and numeric_value. Here's an example with lots of formulas:

gs_ff() %>% 
  gs_read_cellfeed(range = cell_cols("E")) %>% 
  select(-cell_alt, -row, -col) %>% 
  knitr::kable()
#> Accessing worksheet titled 'Sheet1'.

cell value input_value numeric_value

E1 formula formula NA E2 Google =HYPERLINK("http://www.google.com/","Google") NA E3 1,271,591.00 =sum(R[-1]C[-4]:R[3]C[-4]) 1271591.0 E4 =IMAGE("https://www.google.com/images/srpr/logo3w.png") NA E5 $A$1 =ADDRESS(1,1) NA E6 =SPARKLINE(R[-4]C[-4]:R[0]C[-4]) NA

Read on if you want to know more.

Different notions of cell contents

When working with Google Sheets via the cell feed, there are three ways to define cell contents:

Vocabulary: there's formatting and then there's formatting

Click on the Format menu in Google Sheets and you'll gain access to a "Number" sub-menu and ... lots of other stuff. Let's agree that "formatting" can mean two different things:

Decorative formatting is completely invisible to the Sheets API. It is also a terrible idea to encode data in decorative formatting, though it can be used to visually reinforce information that is properly stored in data (Google Sheets is capable of conditional formatting). Nothing in googlesheets or the rest of this vignette addresses decorative formatting. We shall not speak of it again. From now on, "formatting" means numeric formatting.

A worthy challenge

We've created a formula and formatting ~~nightmare~~ sampler Sheet. Go visit it in the browser!. Or check out this screenshot.

gs-test-formula-formatting-screenshot

It's one of the built-in example sheets. Access it with various functions that start with gs_ff.

Here's how it comes in as a data frame by default: you get "literal values" (suppressing a boring column in order to show the interesting ones).

gs_ff() %>% 
  gs_read() %>% 
  select(-integer)
#> Accessing worksheet titled 'Sheet1'.
#> Parsed with column specification:
#> cols(
#>   integer = col_double(),
#>   number_formatted = col_character(),
#>   number_rounded = col_double(),
#>   character = col_character(),
#>   formula = col_character(),
#>   formula_formatted = col_character()
#> )
#> # A tibble: 5 x 5
#>   number_formatted number_rounded character formula      formula_formatted
#>   <chr>                     <dbl> <chr>     <chr>        <chr>            
#> 1 654,321                    1.23 one       Google       3.18E+05         
#> 2 12.34%                     2.35 <NA>      1,271,591.00 52.63%           
#> 3 1.23E+09                   3.46 three     <NA>         0.22             
#> 4 3 1/7                      4.57 four      $A$1         123,456.00       
#> 5 $0.36                      5.68 five      <NA>         317,898

What if you want unformatted numbers? What if you want the actual formulas? You can now get them the cell feed, which, in googlesheets, means you must use gs_read_cellfeed(). You can cause gs_read() to consult the cell feed by specifying literal = FALSE.

The cell feed

Default methods of reading Sheet data assume that the data occupies a neat rectangle in the upper left corner, that you want all of it, and that you want the literal values.

What if you need more control over which cells? What if you want input or numeric values? Use the cell feed via gs_read_cellfeed(). Under the hood, gs_read() will use the cell feed whenever a cell range is provided, i.e. when the call is like gs_read(..., range = "B4:D9") or gs_read(..., range = cell_cols(4:6)), or when the new argument literal = FALSE.

gs_read_cellfeed() has been extended. As before, we return a data frame with one row per cell, but now we return all 3 notions of cell contents:

cf <- gs_read_cellfeed(gs_ff())
#> Accessing worksheet titled 'Sheet1'.

gs-test-formula-formatting-screenshot

cell value input_value numeric_value

A1 integer integer NA A2 123456 123456 123456.0 A3 345678 345678 345678.0 A4 234567 234567 234567.0 A6 567890 567890 567890.0 B1 number_formatted number_formatted NA B2 654,321 654321 654321.0 B3 12.34% 12.34% 0.1234 B4 1.23E+09 1234567890 1.23456789E9 B5 3 1/7 3.14159265359 3.14159265359 B6 \$0.36 0.36 0.36 C1 number_rounded number_rounded NA C2 1.23 1.2345 1.2345 C3 2.35 2.3456 2.3456 C4 3.46 3.4567 3.4567 C5 4.57 4.5678 4.5678 C6 5.68 5.6789 5.6789 D1 character character NA D2 one one NA D4 three three NA D5 four four NA D6 five five NA E1 formula formula NA E2 Google =HYPERLINK("http://www.google.com/","Google") NA E3 1,271,591.00 =sum(R[-1]C[-4]:R[3]C[-4]) 1271591.0 E4 =IMAGE("https://www.google.com/images/srpr/logo3w.png") NA E5 \$A\$1 =ADDRESS(1,1) NA E6 =SPARKLINE(R[-4]C[-4]:R[0]C[-4]) NA F1 formula_formatted formula_formatted NA F2 3.18E+05 =average(R[0]C[-5]:R[4]C[-5]) 317897.75 F3 52.63% =R[-1]C[-5]/R[1]C[-5] 0.5263144432081239 F4 0.22 =R[-2]C[-5]/R[2]C[-5] 0.21739421366813996 F5 123,456.00 =min(R[-3]C[-5]:R[1]C[-5]) 123456.0 F6 317,898 =average(R2C1:R6C1) 317897.75

gs-test-formula-formatting-screenshot

Exploration of cell contents

We explore the different cell contents for different variables. This motivates the logic behind what happens when gs_read(..., literal = FALSE) and gs_simply_cellfeed(..., literal = FALSE).

Formatted numbers

Column 2, number_formatted, holds variously formatted numbers. It is quite pathological, because in real life numeric formatting is likely to be uniform within a column, which helps readr make good decisions about type conversion.

cf %>%
  filter(row > 1, col == 2) %>%
  select(value, input_value, numeric_value) %>% 
  readr::type_convert()
#> Parsed with column specification:
#> cols(
#>   value = col_character(),
#>   input_value = col_character(),
#>   numeric_value = col_double()
#> )
#> # A tibble: 5 x 3
#>   value    input_value   numeric_value
#>   <chr>    <chr>                 <dbl>
#> 1 654,321  654321              6.54e+5
#> 2 12.34%   12.34%              1.23e-1
#> 3 1.23E+09 1234567890          1.23e+9
#> 4 3 1/7    3.14159265359       3.14e+0
#> 5 $0.36    0.36                3.60e-1

Rounded numbers

Column 3, number_rounded, holds numbers with four decimal places, rounded to show just two. Here we want numeric_value.

cf %>%
  filter(row > 1, col == 3) %>%
  select(value, input_value, numeric_value) %>% 
  readr::type_convert()
#> Parsed with column specification:
#> cols(
#>   value = col_double(),
#>   input_value = col_double(),
#>   numeric_value = col_double()
#> )
#> # A tibble: 5 x 3
#>   value input_value numeric_value
#>   <dbl>       <dbl>         <dbl>
#> 1  1.23        1.23          1.23
#> 2  2.35        2.35          2.35
#> 3  3.46        3.46          3.46
#> 4  4.57        4.57          4.57
#> 5  5.68        5.68          5.68

Formulas

Column 5, formula, holds various formulas, not necessarily numeric. Note we had to truncate input_value for printing purposes.

cf %>%
  filter(row > 1, col == 5) %>%
  select(value, input_value, numeric_value) %>% 
  mutate(input_value = substr(input_value, 1, 43)) %>% 
  readr::type_convert()
#> Parsed with column specification:
#> cols(
#>   value = col_character(),
#>   input_value = col_character(),
#>   numeric_value = col_double()
#> )
#> # A tibble: 5 x 3
#>   value        input_value                                   numeric_value
#>   <chr>        <chr>                                                 <dbl>
#> 1 Google       "=HYPERLINK(\"http://www.google.com/\",\"Goo…            NA
#> 2 1,271,591.00 =sum(R[-1]C[-4]:R[3]C[-4])                          1271591
#> 3 <NA>         "=IMAGE(\"https://www.google.com/images/srpr…            NA
#> 4 $A$1         =ADDRESS(1,1)                                            NA
#> 5 <NA>         =SPARKLINE(R[-4]C[-4]:R[0]C[-4])                         NA

Numeric formulas, formatted

Column 6, formula_formatted, holds formatted numeric formulas:

cf %>%
  filter(row > 1, col == 6) %>%
  select(value, input_value, numeric_value) %>% 
  readr::type_convert()
#> Parsed with column specification:
#> cols(
#>   value = col_character(),
#>   input_value = col_character(),
#>   numeric_value = col_double()
#> )
#> # A tibble: 5 x 3
#>   value      input_value                   numeric_value
#>   <chr>      <chr>                                 <dbl>
#> 1 3.18E+05   =average(R[0]C[-5]:R[4]C[-5])    317898.   
#> 2 52.63%     =R[-1]C[-5]/R[1]C[-5]                 0.526
#> 3 0.22       =R[-2]C[-5]/R[2]C[-5]                 0.217
#> 4 123,456.00 =min(R[-3]C[-5]:R[1]C[-5])       123456    
#> 5 317,898    =average(R2C1:R6C1)              317898.

Logic for cell contents when literal = FALSE

Based on the above examples (and more), here's the current logic for which cell contents are used in gs_read(..., literal = FALSE) and gs_reshape_cellfeed(..., literal = FALSE). The goal is to create an input that gives the desired result most often with default behavior of readr::type_convert(). If you think this is wrong, please discuss in an issue.



jennybc/googlesheets documentation built on Feb. 8, 2022, 11:48 p.m.