library(magrittr) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) file_col_specs <- "~/uk_companies_columns.csv" if (file.exists(file_col_specs)) file.remove(file_col_specs) # Delete previous col specs file rm(file_col_specs)
The import configuration tools are a collection of functions that help you importing your data sources.
First let's load the package:
library(graydon.package)
Whenever you need to import data in R, you'll probably want to rename columns so they adhere to the conventions. This can be a pain when maintaining your code, especially when you get consecutive versions of the data, where versions might slightly vary in the number of provided columns. The function apply_column_config
helps with this process by making it easy to rename columnsand in- or exclude columns from the data, by creating a column configuration csv with the following columns:
apply_column_config
function will supply this for you with best guesses, but feel free to edit this afterwards.TRUE
/FALSE
). Initially it is assumed you want to include all columns, but you can edit this.How this all works is best explained by giving examples of scenario's
When applying the apply_column_config
on a data source for the first time, it will create a corresponding source configuration .CSV in the same directory as the data source. The file will have the same base name, but is extended with _columns
and the extension .csv
. The source table column names will also be directly renamed to the new standard.
In this example the source data has already been pre-loaded in memory in the tbl_companies_uk data frame. The data source contains the following columns before processing
data.frame(`Column names` = names(tbl_companies_uk)) %>% knitr::kable()
The data frame was pre-loaded in memory, but for the example's sake, let's assume it has been loaded from the file ~/uk_companies.tsv
. We're going to apply the apply_column_config
the to data frame that was loaded as is (that is, without manipulation), and supplying it's original file source ~/uk_companies.tsv
. You can see that a message is printed saying it created a column spec file with the location ~/uk_companies_columns.csv
.
tbl_companies_uk <- apply_column_config(df_source = tbl_companies_uk, filename_source = "~/uk_companies.tsv")
If we now look at the column names again we see that the function allready did some guess work in cleaning the columns for us.
data.frame(`Column names` = names(tbl_companies_uk)) %>% knitr::kable()
The newly created, corresponding column configuration file now looks like this:
read.csv2("~/uk_companies_columns.csv") %>% knitr::kable()
Note how the column names have changed. All column names were converted to lower case letters, non-alphanumeric characters have been replaced with underscores and wherever the word code or date has been detected, corresponding prefixes have been added.
Although the automatically created column names might be an improvement, it is probably not optimal. Also you might want to have some columns excluded, which is not the default. You can change this yourself by editing the column config file yourself by changing the content of the new_name and include columns respectively.
Let's assume we want to change the column name of GDB_ORG_YUID
to id_graydon
and you want to exclude the column GB_CRO_REGISTRATION_NUMBER
, and the adress related lines; so you adjust the new_name and include columns respectively.
tbl_config <- read.csv2("~/uk_companies_columns.csv", stringsAsFactors = FALSE) # Change new column name idx_row <- (tbl_config$original_name == "GDB_ORG_YUID") idx_column <- which(names(tbl_config) == "new_name") tbl_config[idx_row, idx_column] <- "id_graydon" # Change column inclusion idx_row <- (tbl_config$original_name %in% c("GB_CRO_REGISTRATION_NUMBER", "ADDRESS_LINE_1_RA", "ADDRESS_LINE_2_RA", "ADDRESS_LINE_3_RA", "ADDRESS_LINE_4_RA", "ADDRESS_LINE_5_RA", "POSTCODE_RA", "S_EN_STREET_RA", "S_FR_CITY_RA")) idx_column <- which(names(tbl_config) == "include") tbl_config[idx_row, idx_column] <- "FALSE" write.csv2(tbl_config, "~/uk_companies_columns.csv", row.names = FALSE) rm(idx_row, idx_column) tbl_config %>% dplyr::filter(original_name %in% c("GDB_ORG_YUID", "GB_CRO_REGISTRATION_NUMBER", "ADDRESS_LINE_1_RA", "ADDRESS_LINE_2_RA", "ADDRESS_LINE_3_RA", "ADDRESS_LINE_4_RA", "ADDRESS_LINE_5_RA", "POSTCODE_RA", "S_EN_STREET_RA", "S_FR_CITY_RA")) %>% knitr::kable()
rm(tbl_config, tbl_companies_uk)
Now that the file contents has changed you can reapply it after reloading the source table:
tbl_companies_uk <- apply_column_config(df_source = tbl_companies_uk, filename_source = "~/uk_companies.tsv")
The message "Current column spec file fits current source ~/uk_companies_columns.csv"
tells you the column spec file could be applied to the source data frame. Below you can see the resulting columns of the data frame:
data.frame(`Column names` = names(tbl_companies_uk)) %>% knitr::kable()
rm(tbl_companies_uk)
During an analysis you sometimes get an updated or corrected version of data; and it can happen that new columns are added and/or some are removed. When applying the function apply_column_config
, it tries to keept most of original file specs, and adding/removing the newly added/removed columns. For example, the previous UK market file was redelivered without the address columns, but with two newly added columns: a two digit activity code ACTIVITY_CODE_2
and IS_ACTIVE_STATUS
.
# Remove address columns tbl_companies_uk %<>% dplyr::select(-c("ADDRESS_LINE_1_RA", "ADDRESS_LINE_2_RA", "ADDRESS_LINE_3_RA", "ADDRESS_LINE_4_RA", "ADDRESS_LINE_5_RA", "POSTCODE_RA", "S_EN_STREET_RA", "S_FR_CITY_RA")) # Add new columns tbl_companies_uk$ACTIVITY_CODE_2 <- stringr::str_sub(tbl_companies_uk$ACTIVITY_CODE, 1, 2) tbl_companies_uk$IS_ACTIVE_STATUS <- tbl_companies_uk$IS_ACTIVE_STATUS_CODE == "Y"
When you apply the apply_column_config
function to this new data source, it will return the message that the column spec file has been updated to reflect the new data source.
tbl_companies_uk <- apply_column_config(df_source = tbl_companies_uk, filename_source = "~/uk_companies.tsv")
As you can see the columns reflect the new data source and also the edits on column names of previous column specs are included.
data.frame(`Column names` = names(tbl_companies_uk)) %>% knitr::kable()
Somehow from the developers from readr library thought it was a cool idea to write the read_csv2 function for reading European style CSV-files (with ';' as a field- and ',' as a decimal delimiter), but they forgot to write the write_csv2 counterpart. The syntax is simple:
write_csv2(tbl_companies_uk, path="~/uk_companies_columns.csv")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.