Description Usage Arguments Details Value Categorical diagnostic information See Also Examples
The diagnose_category() produces information for diagnosing the quality of the character(CHAR, VARCHAR, VARCHAR2, etc.) column of the DBMS table through tbl_dbi.
1 2 3 4 5 6 7 8 9 |
.data |
a tbl_dbi. |
... |
one or more unquoted expressions separated by commas. You can treat variable names like they are positions. Positive values select variables; negative values to drop variables. If the first expression is negative, diagnose_category() will automatically start with all variables. These arguments are automatically quoted and evaluated in a context where column names represent column positions. They support unquoting and splicing. |
top |
an integer. Specifies the upper top rank to extract. Default is 10. |
type |
a character string specifying how result are extracted. Default is "rank" that extract top n ranks by decreasing frequency. In this case, if there are ties in rank, more rows than the number specified by the top argument are returned. "n" extract top n rows by decreasing frequency. If there are too many rows to be returned because there are too many ties, you can adjust the returned rows appropriately by using "n". |
in_database |
Specifies whether to perform in-database operations. If TRUE, most operations are performed in the DBMS. if FALSE, table data is taken in R and operated in-memory. |
collect_size |
a integer. The number of data samples from the DBMS to R. Applies only if in_database = FALSE. |
The scope of the diagnosis is the occupancy status of the levels in categorical data. If a certain level of occupancy is close to 100%, then the removal of this variable in the forecast model will have to be considered. Also, if the occupancy of all levels is close to 0%, this variable is likely to be an identifier.
an object of tbl_df.
The information derived from the categorical data diagnosis is as follows.
variables : variable names
levels: level names
N : number of observation
freq : number of observation at the levels
ratio : percentage of observation at the levels
rank : rank of occupancy ratio of levels
See vignette("diagonosis") for an introduction to these concepts.
diagnose_category.data.frame
, diagnose.tbl_dbi
,
diagnose_category.tbl_dbi
, diagnose_numeric.tbl_dbi
,
diagnose_outlier.tbl_dbi
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | library(dplyr)
# connect DBMS
con_sqlite <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
# copy jobchange to the DBMS with a table named TB_JOBCHANGE
copy_to(con_sqlite, jobchange, name = "TB_JOBCHANGE", overwrite = TRUE)
# Using pipes ---------------------------------
# Diagnosis of all categorical variables
con_sqlite %>%
tbl("TB_JOBCHANGE") %>%
diagnose_category()
# Positive values select variables
con_sqlite %>%
tbl("TB_JOBCHANGE") %>%
diagnose_category(company_type, job_chnge)
# Negative values to drop variables, and In-memory mode
con_sqlite %>%
tbl("TB_JOBCHANGE") %>%
diagnose_category(-company_type, -job_chnge, in_database = FALSE)
# Positions values select variables, and In-memory mode and collect size is 200
con_sqlite %>%
tbl("TB_JOBCHANGE") %>%
diagnose_category(7, in_database = FALSE, collect_size = 200)
# Positions values select variables
con_sqlite %>%
tbl("TB_JOBCHANGE") %>%
diagnose_category(-7)
# Top rank levels with top argument
con_sqlite %>%
tbl("TB_JOBCHANGE") %>%
diagnose_category(top = 2)
# Using pipes & dplyr -------------------------
# Extraction of level that is more than 60% of categorical data
con_sqlite %>%
tbl("TB_JOBCHANGE") %>%
diagnose_category() %>%
filter(ratio >= 60)
# Using type argument -------------------------
dfm <- data.frame(alpabet = c(rep(letters[1:5], times = 5), "c"))
# copy dfm to the DBMS with a table named TB_EXAMPLE
copy_to(con_sqlite, dfm, name = "TB_EXAMPLE", overwrite = TRUE)
# extract rows that less than equal rank 10
# default of top argument is 10
con_sqlite %>%
tbl("TB_EXAMPLE") %>%
diagnose_category()
# extract rows that less than equal rank 2
con_sqlite %>%
tbl("TB_EXAMPLE") %>%
diagnose_category(top = 2, type = "rank")
# extract rows that less than equal rank 2
# default of type argument is "rank"
con_sqlite %>%
tbl("TB_EXAMPLE") %>%
diagnose_category(top = 2)
# extract only 2 rows
con_sqlite %>%
tbl("TB_EXAMPLE") %>%
diagnose_category(top = 2, type = "n")
# Disconnect DBMS
DBI::dbDisconnect(con_sqlite)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.