load_table_from_file: Load a data file to a SQL table

View source: R/load_table_from_file.R

load_table_from_fileR Documentation

Load a data file to a SQL table

Description

load_table_from_file loads a file data to a SQL table using specified variables or a YAML config file.

Usage

load_table_from_file(
  conn,
  server = NULL,
  overall = T,
  ind_yr = F,
  years = NULL,
  combine_yr = F,
  config = NULL,
  config_url = NULL,
  config_file = NULL,
  to_schema = NULL,
  to_table = NULL,
  server_path = NULL,
  db_name = NULL,
  azure = F,
  azure_uid = NULL,
  azure_pwd = NULL,
  file_path = NULL,
  field_term = NULL,
  row_term = NULL,
  first_row = 2,
  truncate = T,
  drop_index = T,
  tablock = F,
  test_schema = NULL,
  use_sys = F
)

Arguments

conn

SQL server connection created using odbc package.

server

Name of server being used (only applies if using a YAML file). Useful if the same table is loaded to multiple servers but with different names or schema. Note that this is different from the server_path argument that is used as part of the BCP command; the server argument can be any name a user wants whereas server_path must be the actual server name.

overall

Load a single table instead of a table for each calendar year. Mutually exclusive with ind_yr option. Default is TRUE.

ind_yr

Load multiple tables, one for each calendar year, with a year suffix on each table name (e.g., mcaid_elig_2014). Mutually exclusive with overall option. If using this option, the list of years should be provided via the years argument or a years variable in the YAML file. Default is FALSE.

years

Vector of individual years to make tables for (if not using YAML input).

combine_yr

Union year-specific files into a single table. Only applies if ind_yr = T. Default is FALSE.

config

Name of object in global environment that contains configuration information. Use one of config, config_url, or config_file. Should be in a YAML format with at least the following variables: to_schema, to_table, server_path, db_name, and file_path, with possibly field_term, row_term, and first_row. to_schema and to_table, server_path, and db_name should all be nested under the server name if applicable, other variables should not (but might be nested under a calendar year).

config_url

URL of a YAML config file. Use one of config, config_url, or config_file. Note the requirements under config.

config_file

File path of a YAML config file. Use one of config, config_url, or config_file. Note the requirements under config.

to_schema

Name of the schema that data will be loaded to (if not using YAML input).

to_table

Name of the table that data will be loaded to (if not using YAML input).

server_path

Name of the SQL server to connect to (if not using YAML input). If using Azure, only seems to work if you specify an existing DSN connection.

db_name

Name of the database to use (if not using YAML input).

azure

Flag to indicate data are being loaded to an Azure SQL server. Default is FALSE.

azure_uid

Username for connecting to Azure Active Directory. Only use if azure = T.

azure_pwd

Password for connecting to Azure Active Directory. Only use if azure = T.

file_path

File path of data to be loaded (if not using YAML input). If ind_yr = T, this should be a named vector with the format c("2014" = "//path1/folder1/file1.ext", "2015" = "//path1/folder1/file2.ext") where the name matches the years to load.

field_term

Field terminator in the data (if not using YAML input). If using ind_yr = T and the terminator differs between calendar years, this should be a named vector with the format c("overall" = "\t", "2014" = "\|", "2016" = "\0") where "overall" supplies the default terminator and the other names match the years that differ. Do not use a named vector if overall = T or if there is no variation between years. The BCP default is \t.

row_term

Row terminator in the data (if not using YAML input). If using ind_yr = T and the terminator differs between calendar years, this should be a named vector with the format c("overall" = "\n", "2014" = "\r", "2016" = "\0") where "overall" supplies the default terminator and the other names match the years that differ. Do not use a named vector if overall = T or if there is no variation between years. The BCP default is \n.

first_row

Row number of the first line of data (if not using YAML input). Default is 2 (assumes a header row). Currently this must be the same for all years.

truncate

Truncate existing table prior to loading. Default is TRUE.

drop_index

Drop any existing indices prior to loading data. This can speed loading times substantially. Use add_index to restore the index after. Default is TRUE.

tablock

Logical (TRUE | FALSE). Lock the entire table for duration of loading process to improve performance. Default is FALSE.

test_schema

Write to a temporary/development schema when testing out table creation. Will use the to_schema (specified or in the YAML file) to make a new table name of {to_schema}_{to_table}. Schema must already exist in the database. Most useful when the user has an existing YAML file and does not want to overwrite it. Only 1,000 rows will be loaded to each table. Default is NULL.

use_sys

If the sys package is installed, use this to call BCP and see a more informative interface in the R console. Helpful for debugging when BSP doesn't work. Default is FALSE.

Details

This function loads a data file to an already existing SQL table using specified variables or a YAML configuration file. The function is essentially a wrapper for the SQL bulk copy program (BCP) utility using integrated authorization. Users can specify some input functions (e.g., to_table) and rely on the config file for the rest of the necessary information. For all arguments that could be specified or come from a YAML file, the hierarchy is specified > argument under server in YAML > argument not under server in YAML. Note that arguments that should not vary between servers (e.g., row_terminator) should not be listed under a server in the YAML file. Note: This function does not work to load to an Azure SQL DB (as of May 2022)

Example YAML file with no server or individual years

(Assume the indentation is appropriate)

to_schema: raw
to_table: mcaid_elig
*optional other components like a qa_schema and qa_table, index name, vars, etc.*
server_path: KCITSQLABCD51
db_name: PHClaims
file_path: //path123/importdata/Data/kc_elig_20210519.txt
field_term: \t
row_term: \n

Example YAML file with servers (phclaims, hhsaw) and individual years

(Assume the indentation is appropriate)

phclaims:
    to_schema: raw
    to_table: mcaid_elig
    server_path: KCITSQLABCD51
    db_name: PHClaims
hhsaw:
    to_schema: raw
    to_table: mciad_elig
    server_path: kcitazxyz20.database.windows.net
    db_name: hhs_analytics_workspace
*optional other components like a qa_schema and qa_table, index name, vars, etc.*
field_term: \t
row_term: \n
years:
    2014
    2015
    2016
2014:
    file_path: //path123/importdata/Data/kc_elig_2014.txt
    field_term: \|
    row_term: \r
2015:
    file_path: //path123/importdata/Data/kc_elig_2015.txt  
2016:
    file_path: //path123/importdata/Data/kc_elig_2016.txt
    field_term: \0
    row_term: \0

Examples

## Not run: 
load_table(conn = db_claims, server = "hhsaw", config = load_config)
load_table(conn = db_claims, server = "phclaims", 
config_url = "https://raw.githubusercontent.com/PHSKC-APDE/claims_data/load_mcaid_raw.R",
overall = F, ind_yr = T)

## End(Not run)


PHSKC-APDE/apde documentation built on April 14, 2025, 10:46 a.m.