README.md

dbautojoinr

Tutorial

Installation

if (!("devtools" %in% rownames(installed.packages()))) install.packages("devtools", repos = "https://cloud.r-project.org")
library(devtools)
if (!("dbautojoinr" %in% rownames(installed.packages()))) install_github("N1h1l1sT/dbautojoinr", upgrade = FALSE)
library(dbautojoinr)

Note:

There's a working demo playing out a use-case, complete with the SQL Database and R code used to auto-join the SQL Database. You can either skip to the written demo below, or click this link to watch the demo on YouTube, or continue reading below for the how-to-use general tutorial

Known issues

Initialisation

Before one can use the automatic join functions, initialisation has to occur so that: A connection to the SQL Database is established The "db_fields" Dataframe is returned so that the user can select which columns they want to include. The whole process revolves around the db_fields DF, which can be configured by the user in a user-friendly way using mouse clicks to select SQL fields.

library(dbautojoinr)

#If the file exists, it's read and the db_fields object is created by it.
#If it doesn't, then it's created with every field included, with potentially IDs and FKs excluded, depending on what you select
db_fields_path <- paste0(getwd(), "/db_fields.csv")

db_fields <- initialise_return_db_fields(csv_path = db_fields_path,
                                         Driver = "{SQL Server};",
                                         Database = "DB_Name",
                                         Server = "123.456.78.9",
                                         UID = NULL,
                                         PWD = NULL,
                                         Trusted_Connection = TRUE,
                                         Port = 1433,
                                         ForceCreate_csv = FALSE, #If TRUE then even if the db_fields exist, it will be deleted and overwriten by a newly created default db_fields
                                         ExcludeIdentities = FALSE,
                                         ExcludeForeignKeys = TRUE,
                                         Update_DBEnv_DBFields = TRUE, #If TRUE and creating db_fields file, you can have an internal main db_fields accessible via db$db_fields usually acting as the main db_fields. Default is FALSE as having local db_fields variables is the default behaviour.
                                         ExcludeAuditingFields = FALSE, #If TRUE and creating db_fields file, any SQL Columns ending with "_OrigEntryOn", "_OrigEntryBy", "_EntryOn", "_EntryBy", "_CompName", "_Remote" or "_Username" will have INCLUDE == FALSE by default
                                         ExcludeSYSDIAGRAMS = TRUE, #If TRUE and creating db_fields file, any SQL Columns on the table "sysdiagrams" will have INCLUDE == FALSE by default
                                         RegexToSelectTables = "^(DIM_|FACT_|TBL_)"#, #A regex that will get tables which you want to be deselected by default. As it is it matches all SQL Tables whose name begins with DIM_, FACT_, or TBL_
                                         #Table1$T1_ForeignKey1 == Table2$T2_ID, #Please notice that Foreign Keys are always
                                         #Table2$T2_ForeignKey1 == Table3$T3_ID, #on the left hand side, whilst IDs are always
                                         #Table4$T4_ForeignKey1 == Table2$T2_ID, #on the right hand side.
                                         #Table4$T4_ForeignKey2 == Table1$T1_ID
#If you haven't set the SQL Relationships on the Database, you can impose them here by
#uncommenting the lines above and replacing the text with the actual tables and columns

If the db_fields .csv file doesn't exist and it's created or if you set ForceCreate_csv = TRUE, then you may also initialise the db_fields with IDs and FKs selected or excluded using ExcludeIdentities = FALSE and ExcludeForeignKeys = TRUE.

Select which SQL Columns you need

db_fields <- edit_db_fields(db_fields, Update_DBEnv_DBFields = TRUE) #What the user selected is now saved on the db_fields variable (AND on db$db_fields because Update_DBEnv_DBFields == TRUE).
write_db_fields_csv(db_fields, db_fields_path) #For any run on the current session, the user preferences are assumed, but we need to save the file for future runs.

Useage

Depending on what you want to achieve, there are different levels of joining that you might want to do.

Please note that Initialisation has to have occurred before any code below can be used.

Getting the main_joint_tables

main_joint_tables <-
  create_main_joint_tables(db_fields = db_fields,
                           db_forced_rel = NULL, #We don't want to Force any relationships to create a 1-JointTable, so db_forced_rel is NULL
                           con = db$con,
                           DeselectKeysIfIncludeFalse = TRUE
                           )

New Arguments: * db_fields: A DF with columns: "Include, KeyType, Table, Column, Type, RelationshipWithTable, RelationshipWithColumn, Transformation, Comment" about the User Selected fields and Relationships * db_forced_rel: A Named String Vector. The vector names MUST point to the main table to be used for the 1-Joint-Table as its LHS * con: A dbConnect {DBI} connection object to a SQL Database * DeselectKeysIfIncludeFalse: A Boolean. Must be FALSE if we need to continue to 1-Joint-Table, otherwise needed Identity and Foreign keys might be missing * Verbose: A Boolean. Verbose = TRUE will output the consecutive joins as they happen * get_sql_query: A Boolean. get_sql_query = TRUE will create/edit the db$sql_main_joint_tables that output the SQL Code for the tables

Getting the joint_table_Without_extended_joins

From hereinafter we need to have configured the db_forced_rel variable with the forced relationships that we want to impose in order to join the Main tables into 1 table
#Assumptions: Database is in Canonical Form, No two columns have the same name (Usual good practice in Databases)
db_forced_rel <-
  c(                #The LHS of the Relationships MUST be Columns from the main table to be used for the 1-Joint-Table
    Hours_SiteID = "Site_ID",
    Hours_EmployeeID = "Employee_ID"
  )

joint_table_Without_extended_joins <-
  create_joint_table(db_fields = db_fields,
                      db_forced_rel = db_forced_rel)

Getting the extended_main_joint_tables

From hereinafter we need to have configured the db_ColumnsOldNamesToNewNames variable with the renaming schema so that when the same table is joined with different Main tables, the column names change to reflect the different meaning
#DIM_Site will be joined with DIM_Employee, but also with FACT_Hours.
#An employee will work on a certain site each day, which might be different from day to day,
#but the original site he is assigned to will always remain the same - his Main Site.
#DIM_Site holds the Site information, so when it's joined with DIM_Employee, its meaning is the employees Main Site
#However, when it's joined with FACT_Hours, its meaning is the site in which the employee has worked on at that particular day.
#If we are to create a 1-Joint-Table, then the SQL columns cannot have the same name. So we're renaming the columns that
#that come from the DIM_Site table and joined with DIM_Employee into MainSite_[SomeName] instead of Site_[SomeName]
#On the final table (1-Joint-Table) MainSite_ID column will refer to the Site that the employee is assigned to, and Site_ID will refer to the one which he worked that particular day

db_ColumnsOldNamesToNewNames <-
  list(
    DIM_Employee = c(
      c("Site_", "MainSite_")
    )
  )

extended_main_joint_tables <-
  create_extended_main_joint_tables(db_fields = db_fields,
                                    db_forced_rel = db_forced_rel,
                                    db_ColumnsOldNamesToNewNames = db_ColumnsOldNamesToNewNames
                                    )

New Arguments: * db_ColumnsOldNamesToNewNames: A named List. Names correspond to the Table names, and the vectors inside will be used to renamed SQL Columns starting with db_ColumnsOldNamesToNewNames[i][j] to db_ColumnsOldNamesToNewNames[i][j+1] with j going from 1 to length of db_ColumnsOldNamesToNewNames[i] by 2

Getting the joint_table_With_extended_joins

joint_table_With_extended_joins <-
  create_extended_joint_table(db_fields = db_fields,
                              db_forced_rel = db_forced_rel,
                              db_ColumnsOldNamesToNewNames = db_ColumnsOldNamesToNewNames
                              )

And that is all! You've now officially gotten all the different possible levels of joining. Hurray!

Demo

1. Create and populate the Database

Copy the SQL code under the "SQL folder" of the package into SSMS and execute the Query. First execute Database Creation.sql and then execute Data Population.sql. Now you have a "dbautojoinr" SQL Database on your SQL Server with 4 tables (DIM_Employee, DIM_Region, DIM_Site, and FACT_Hours) and with data populated on those tables. The SQL Relationships also already exist on your SQL Database, so you won't need to explicitly impose them on the Initialisation code.

Populated SQL Database

2. Install and Initialise dbautojoinr

if (!("devtools" %in% rownames(installed.packages()))) install.packages("devtools", repos = "https://cloud.r-project.org")
library(devtools)
if (!("dbautojoinr" %in% rownames(installed.packages()))) install_github("N1h1l1sT/dbautojoinr", upgrade = FALSE)
library(dbautojoinr)

db_fields_path <- paste0(getwd(), "/db_fields.csv")

db_fields <- initialise_return_db_fields(csv_path = db_fields_path,
                                         ForceCreate_csv = FALSE,
                                         ExcludeIdentities = FALSE,
                                         ExcludeForeignKeys = TRUE,
                                         Driver = "{SQL Server};",
                                         Database = "dbautojoinr",
                                         Server = "Put your own Server IP/Name here",
                                         UID = NULL,
                                         PWD = NULL,
                                         Trusted_Connection = TRUE,
                                         Port = 1433
)


show_ER_diagramme(db$dm_f) #Shows the SQL Database ER Diagramme

dbautojoinr ER Diagramme

3. Select only the SQL Columns that you want to view

db_fields <- edit_db_fields(db_fields)
write_db_fields_csv(db_fields, db_fields_path)

Feature Selection

4. Set needed parameters

db_forced_rel <-
  c(
    Hours_SiteID = "Site_ID",
    Hours_EmployeeID = "Employee_ID"
  )
db_ColumnsOldNamesToNewNames <-
  list(
    DIM_Employee = c(
      c("Site_", "MainSite_")
    )
  )

5. Get the Main Tables

main_joint_tables <-
  create_main_joint_tables(db_fields,
                        db_forced_rel,
                        db$con,
                        DeselectKeysIfIncludeFalse = TRUE, #No need to make any other joins, so let's only get what the User selected
                        Verbose = TRUE,
                        get_sql_query = FALSE
                        )

6. Get the Extended Main Tables

extended_main_joint_tables <-
  create_main_joint_tables(db_fields,
                        db_forced_rel,
                        db$con,
                        DeselectKeysIfIncludeFalse = FALSE,
                        Verbose = TRUE,
                        get_sql_query = FALSE
                        ) %>%
  zinternal_CreateExtendedMainJointTables(db_fields,
                                db_forced_rel,
                                db_ColumnsOldNamesToNewNames,
                                db$con,
                                DeselectKeysIfIncludeFalse = TRUE,
                                Verbose = TRUE,
                                get_sql_query = FALSE
                                )

You've probably noticed that instead of using create_extended_main_joint_tables to get the result, we're now using 2 different functions that each performs 1 step. The 1st one (create_main_joint_tables) will retrieve the Main Tables, whilst the 2nd (zinternal_CreateExtendedMainJointTables) does the extended joins (in our case, it joins main_joint_tables[[DIM_Employee]].[MainSite_ID] with [DIM_Site].[Site_ID] ) Now, you might want to proceed with this long way if you want to also make custom transformations to some table before the next joining level occurs. Be careful with DeselectKeysIfIncludeFalse which must always be FALSE prior to the last level and always TRUE at the last one.

7. Get the 1-Joint-Table (No renaming or extended joins)

joint_table_Without_extended_joins <-
  create_main_joint_tables(db_fields,
                        db_forced_rel,
                        db$con,
                        DeselectKeysIfIncludeFalse = FALSE,
                        Verbose = TRUE,
                        get_sql_query = FALSE
                        ) %>%
  zinternal_CreateOneJointTable(db_fields,
                      db_forced_rel,
                      db$con,
                      Verbose = TRUE,
                      get_sql_query = FALSE
                      )

8. Get the 1-Joint-Table with Extended Joins

joint_table_With_extended_joins <-
  create_main_joint_tables(db_fields,
                        db_forced_rel,
                        db$con,
                        DeselectKeysIfIncludeFalse = FALSE,
                        Verbose = TRUE,
                        get_sql_query = FALSE
                        ) %>%
  zinternal_CreateExtendedMainJointTables(db_fields,
                                db_forced_rel,
                                db_ColumnsOldNamesToNewNames,
                                db$con,
                                DeselectKeysIfIncludeFalse = FALSE,
                                Verbose = TRUE,
                                get_sql_query = FALSE
                                ) %>%
  zinternal_CreateOneJointTable(db_fields,
                      db_forced_rel,
                      db$con,
                      Verbose = TRUE,
                      get_sql_query = FALSE
                      )

9. Looking at the results

sapply(names(main_joint_tables), function(x) NCOL(main_joint_tables[[x]]))
# Number of Columns on the tables of main_joint_tables
#    DIM_Site DIM_Employee   FACT_Hours 
#           2            4            6

sapply(names(extended_main_joint_tables), function(x) NCOL(extended_main_joint_tables[[x]]))
# Number of Columns on the tables of extended_main_joint_tables
#    DIM_Site DIM_Employee   FACT_Hours 
#           2            5            6 

print(c(joint_table = NCOL(joint_table_Without_extended_joins), joint_table_extended = NCOL(joint_table_With_extended_joins)))
# Number of Columns on the 2 Fully Joint Tables, withand without extended joins
#    joint_table joint_table_extended 
#           7            9 

End of Demo



N1h1l1sT/dbautojoinr documentation built on Jan. 24, 2021, 10:15 a.m.