diagnose_category.tbl_dbi: Diagnose data quality of categorical variables in the DBMS
In bit2r/kodlookr: Korean Help Resources for the dlookr Package

Description Usage Arguments Details Value Categorical diagnostic information See Also Examples

The diagnose_category() produces information for diagnosing the quality of the character(CHAR, VARCHAR, VARCHAR2, etc.) column of the DBMS table through tbl_dbi.

## S3 method for class 'tbl_dbi'
diagnose_category(
  .data,
  ...,
  top = 10,
  type = c("rank", "n")[1],
  in_database = TRUE,
  collect_size = Inf
)

`.data`	a tbl_dbi.
`...`	one or more unquoted expressions separated by commas. You can treat variable names like they are positions. Positive values select variables; negative values to drop variables. If the first expression is negative, diagnose_category() will automatically start with all variables. These arguments are automatically quoted and evaluated in a context where column names represent column positions. They support unquoting and splicing.
`top`	an integer. Specifies the upper top rank to extract. Default is 10.
`type`	a character string specifying how result are extracted. Default is "rank" that extract top n ranks by decreasing frequency. In this case, if there are ties in rank, more rows than the number specified by the top argument are returned. "n" extract top n rows by decreasing frequency. If there are too many rows to be returned because there are too many ties, you can adjust the returned rows appropriately by using "n".
`in_database`	Specifies whether to perform in-database operations. If TRUE, most operations are performed in the DBMS. if FALSE, table data is taken in R and operated in-memory.
`collect_size`	a integer. The number of data samples from the DBMS to R. Applies only if in_database = FALSE.

The scope of the diagnosis is the occupancy status of the levels in categorical data. If a certain level of occupancy is close to 100%, then the removal of this variable in the forecast model will have to be considered. Also, if the occupancy of all levels is close to 0%, this variable is likely to be an identifier.

an object of tbl_df.

The information derived from the categorical data diagnosis is as follows.

variables : variable names
levels: level names
N : number of observation
freq : number of observation at the levels
ratio : percentage of observation at the levels
rank : rank of occupancy ratio of levels

See vignette("diagonosis") for an introduction to these concepts.

diagnose_category.data.frame, diagnose.tbl_dbi, diagnose_category.tbl_dbi, diagnose_numeric.tbl_dbi, diagnose_outlier.tbl_dbi.

library(dplyr)

# connect DBMS
con_sqlite <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

# copy jobchange to the DBMS with a table named TB_JOBCHANGE
copy_to(con_sqlite, jobchange, name = "TB_JOBCHANGE", overwrite = TRUE)

# Using pipes ---------------------------------
# Diagnosis of all categorical variables
con_sqlite %>% 
  tbl("TB_JOBCHANGE") %>% 
  diagnose_category()
  
# Positive values select variables
con_sqlite %>% 
  tbl("TB_JOBCHANGE") %>% 
  diagnose_category(company_type, job_chnge)
  
# Negative values to drop variables, and In-memory mode
con_sqlite %>% 
  tbl("TB_JOBCHANGE") %>% 
  diagnose_category(-company_type, -job_chnge, in_database = FALSE)
  
# Positions values select variables, and In-memory mode and collect size is 200
con_sqlite %>% 
  tbl("TB_JOBCHANGE") %>% 
  diagnose_category(7, in_database = FALSE, collect_size = 200) 
  
# Positions values select variables
con_sqlite %>% 
  tbl("TB_JOBCHANGE") %>% 
  diagnose_category(-7)
  
# Top rank levels with top argument
con_sqlite %>% 
  tbl("TB_JOBCHANGE") %>% 
  diagnose_category(top = 2)

# Using pipes & dplyr -------------------------
# Extraction of level that is more than 60% of categorical data
con_sqlite %>% 
  tbl("TB_JOBCHANGE") %>% 
  diagnose_category()  %>%
  filter(ratio >= 60)
  
# Using type argument -------------------------
 dfm <- data.frame(alpabet = c(rep(letters[1:5], times = 5), "c")) 
 
# copy dfm to the DBMS with a table named TB_EXAMPLE
copy_to(con_sqlite, dfm, name = "TB_EXAMPLE", overwrite = TRUE)  
 
# extract rows that less than equal rank 10
# default of top argument is 10
con_sqlite %>% 
  tbl("TB_EXAMPLE") %>% 
  diagnose_category()
   
# extract rows that less than equal rank 2
con_sqlite %>% 
  tbl("TB_EXAMPLE") %>% 
  diagnose_category(top = 2, type = "rank")
   
# extract rows that less than equal rank 2
# default of type argument is "rank"
con_sqlite %>% 
  tbl("TB_EXAMPLE") %>% 
  diagnose_category(top = 2)
 
 # extract only 2 rows
con_sqlite %>% 
  tbl("TB_EXAMPLE") %>% 
  diagnose_category(top = 2, type = "n")

# Disconnect DBMS   
DBI::dbDisconnect(con_sqlite)