load_table_from_file: Load a data file to a SQL table
In PHSKC-APDE/apde: Common functions used in APDE's work

load_table_from_file

R Documentation

Load a data file to a SQL table

Description

load_table_from_file loads a file data to a SQL table using specified variables or a YAML config file.

Usage

load_table_from_file(
  conn,
  server = NULL,
  overall = T,
  ind_yr = F,
  years = NULL,
  combine_yr = F,
  config = NULL,
  config_url = NULL,
  config_file = NULL,
  to_schema = NULL,
  to_table = NULL,
  server_path = NULL,
  db_name = NULL,
  azure = F,
  azure_uid = NULL,
  azure_pwd = NULL,
  file_path = NULL,
  field_term = NULL,
  row_term = NULL,
  first_row = 2,
  truncate = T,
  drop_index = T,
  tablock = F,
  test_schema = NULL,
  use_sys = F
)

Arguments

`conn`	SQL server connection created using `odbc` package.
`server`	Name of server being used (only applies if using a YAML file). Useful if the same table is loaded to multiple servers but with different names or schema. Note that this is different from the `server_path` argument that is used as part of the BCP command; the `server` argument can be any name a user wants whereas `server_path` must be the actual server name.
`overall`	Load a single table instead of a table for each calendar year. Mutually exclusive with `ind_yr` option. Default is `TRUE`.
`ind_yr`	Load multiple tables, one for each calendar year, with a year suffix on each table name (e.g., mcaid_elig_2014). Mutually exclusive with `overall` option. If using this option, the list of years should be provided via the `years` argument or a `years` variable in the YAML file. Default is `FALSE`.
`years`	Vector of individual years to make tables for (if not using YAML input).
`combine_yr`	Union year-specific files into a single table. Only applies if ind_yr = T. Default is `FALSE`.
`config`	Name of object in global environment that contains configuration information. Use one of `config`, `config_url`, or `config_file`. Should be in a YAML format with at least the following variables: `to_schema`, `to_table`, `server_path`, `db_name`, and `file_path`, with possibly `field_term`, `row_term`, and `first_row`. `to_schema` and `to_table`, `server_path`, and `db_name` should all be nested under the server name if applicable, other variables should not (but might be nested under a calendar year).
`config_url`	URL of a YAML config file. Use one of `config`, `config_url`, or `config_file`. Note the requirements under `config`.
`config_file`	File path of a YAML config file. Use one of `config`, `config_url`, or `config_file`. Note the requirements under `config`.
`to_schema`	Name of the schema that data will be loaded to (if not using YAML input).
`to_table`	Name of the table that data will be loaded to (if not using YAML input).
`server_path`	Name of the SQL server to connect to (if not using YAML input). If using Azure, only seems to work if you specify an existing DSN connection.
`db_name`	Name of the database to use (if not using YAML input).
`azure`	Flag to indicate data are being loaded to an Azure SQL server. Default is `FALSE`.
`azure_uid`	Username for connecting to Azure Active Directory. Only use if azure = T.
`azure_pwd`	Password for connecting to Azure Active Directory. Only use if azure = T.
`file_path`	File path of data to be loaded (if not using YAML input). If ind_yr = T, this should be a named vector with the format `c("2014" = "//path1/folder1/file1.ext", "2015" = "//path1/folder1/file2.ext")` where the name matches the years to load.
`field_term`	Field terminator in the data (if not using YAML input). If using ind_yr = T and the terminator differs between calendar years, this should be a named vector with the format `c("overall" = "\t", "2014" = "\\|", "2016" = "\0")` where "overall" supplies the default terminator and the other names match the years that differ. Do not use a named vector if overall = T or if there is no variation between years. The BCP default is `\t`.
`row_term`	Row terminator in the data (if not using YAML input). If using ind_yr = T and the terminator differs between calendar years, this should be a named vector with the format `c("overall" = "\n", "2014" = "\r", "2016" = "\0")` where "overall" supplies the default terminator and the other names match the years that differ. Do not use a named vector if overall = T or if there is no variation between years. The BCP default is `\n`.
`first_row`	Row number of the first line of data (if not using YAML input). Default is `2` (assumes a header row). Currently this must be the same for all years.
`truncate`	Truncate existing table prior to loading. Default is `TRUE`.
`drop_index`	Drop any existing indices prior to loading data. This can speed loading times substantially. Use `add_index` to restore the index after. Default is `TRUE`.
`tablock`	Logical (`TRUE` \| `FALSE`). Lock the entire table for duration of loading process to improve performance. Default is `FALSE`.
`test_schema`	Write to a temporary/development schema when testing out table creation. Will use the to_schema (specified or in the YAML file) to make a new table name of {to_schema}_{to_table}. Schema must already exist in the database. Most useful when the user has an existing YAML file and does not want to overwrite it. Only 1,000 rows will be loaded to each table. Default is `NULL`.
`use_sys`	If the sys package is installed, use this to call BCP and see a more informative interface in the R console. Helpful for debugging when BSP doesn't work. Default is `FALSE`.

Details

This function loads a data file to an already existing SQL table using specified variables or a YAML configuration file. The function is essentially a wrapper for the SQL bulk copy program (BCP) utility using integrated authorization. Users can specify some input functions (e.g., to_table) and rely on the config file for the rest of the necessary information. For all arguments that could be specified or come from a YAML file, the hierarchy is specified > argument under server in YAML > argument not under server in YAML. Note that arguments that should not vary between servers (e.g., row_terminator) should not be listed under a server in the YAML file. Note: This function does not work to load to an Azure SQL DB (as of May 2022)

Example YAML file with no server or individual years

(Assume the indentation is appropriate)

to_schema: raw
to_table: mcaid_elig
*optional other components like a qa_schema and qa_table, index name, vars, etc.*
server_path: KCITSQLABCD51
db_name: PHClaims
file_path: //path123/importdata/Data/kc_elig_20210519.txt
field_term: \t
row_term: \n

Example YAML file with servers (phclaims, hhsaw) and individual years

(Assume the indentation is appropriate)

phclaims:
    to_schema: raw
    to_table: mcaid_elig
    server_path: KCITSQLABCD51
    db_name: PHClaims
hhsaw:
    to_schema: raw
    to_table: mciad_elig
    server_path: kcitazxyz20.database.windows.net
    db_name: hhs_analytics_workspace
*optional other components like a qa_schema and qa_table, index name, vars, etc.*
field_term: \t
row_term: \n
years:
    2014
    2015
    2016
2014:
    file_path: //path123/importdata/Data/kc_elig_2014.txt
    field_term: \|
    row_term: \r
2015:
    file_path: //path123/importdata/Data/kc_elig_2015.txt  
2016:
    file_path: //path123/importdata/Data/kc_elig_2016.txt
    field_term: \0
    row_term: \0

Examples

## Not run: 
load_table(conn = db_claims, server = "hhsaw", config = load_config)
load_table(conn = db_claims, server = "phclaims", 
config_url = "https://raw.githubusercontent.com/PHSKC-APDE/claims_data/load_mcaid_raw.R",
overall = F, ind_yr = T)

## End(Not run)

PHSKC-APDE/apde documentation built on April 14, 2025, 10:46 a.m.