diagnose_category.tbl_dbi: Diagnose data quality of categorical variables in the DBMS

Description Usage Arguments Details Value Categorical diagnostic information See Also Examples

Description

The diagnose_category() produces information for diagnosing the quality of the character(CHAR, VARCHAR, VARCHAR2, etc.) column of the DBMS table through tbl_dbi.

Usage

1
2
3
4
5
6
7
8
9
## S3 method for class 'tbl_dbi'
diagnose_category(
  .data,
  ...,
  top = 10,
  type = c("rank", "n")[1],
  in_database = TRUE,
  collect_size = Inf
)

Arguments

.data

a tbl_dbi.

...

one or more unquoted expressions separated by commas. You can treat variable names like they are positions. Positive values select variables; negative values to drop variables. If the first expression is negative, diagnose_category() will automatically start with all variables. These arguments are automatically quoted and evaluated in a context where column names represent column positions. They support unquoting and splicing.

top

an integer. Specifies the upper top rank to extract. Default is 10.

type

a character string specifying how result are extracted. Default is "rank" that extract top n ranks by decreasing frequency. In this case, if there are ties in rank, more rows than the number specified by the top argument are returned. "n" extract top n rows by decreasing frequency. If there are too many rows to be returned because there are too many ties, you can adjust the returned rows appropriately by using "n".

in_database

Specifies whether to perform in-database operations. If TRUE, most operations are performed in the DBMS. if FALSE, table data is taken in R and operated in-memory.

collect_size

a integer. The number of data samples from the DBMS to R. Applies only if in_database = FALSE.

Details

The scope of the diagnosis is the occupancy status of the levels in categorical data. If a certain level of occupancy is close to 100%, then the removal of this variable in the forecast model will have to be considered. Also, if the occupancy of all levels is close to 0%, this variable is likely to be an identifier.

Value

an object of tbl_df.

Categorical diagnostic information

The information derived from the categorical data diagnosis is as follows.

See vignette("diagonosis") for an introduction to these concepts.

See Also

diagnose_category.data.frame, diagnose.tbl_dbi, diagnose_category.tbl_dbi, diagnose_numeric.tbl_dbi, diagnose_outlier.tbl_dbi.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
library(dplyr)

# connect DBMS
con_sqlite <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

# copy jobchange to the DBMS with a table named TB_JOBCHANGE
copy_to(con_sqlite, jobchange, name = "TB_JOBCHANGE", overwrite = TRUE)

# Using pipes ---------------------------------
# Diagnosis of all categorical variables
con_sqlite %>% 
  tbl("TB_JOBCHANGE") %>% 
  diagnose_category()
  
# Positive values select variables
con_sqlite %>% 
  tbl("TB_JOBCHANGE") %>% 
  diagnose_category(company_type, job_chnge)
  
# Negative values to drop variables, and In-memory mode
con_sqlite %>% 
  tbl("TB_JOBCHANGE") %>% 
  diagnose_category(-company_type, -job_chnge, in_database = FALSE)
  
# Positions values select variables, and In-memory mode and collect size is 200
con_sqlite %>% 
  tbl("TB_JOBCHANGE") %>% 
  diagnose_category(7, in_database = FALSE, collect_size = 200) 
  
# Positions values select variables
con_sqlite %>% 
  tbl("TB_JOBCHANGE") %>% 
  diagnose_category(-7)
  
# Top rank levels with top argument
con_sqlite %>% 
  tbl("TB_JOBCHANGE") %>% 
  diagnose_category(top = 2)

# Using pipes & dplyr -------------------------
# Extraction of level that is more than 60% of categorical data
con_sqlite %>% 
  tbl("TB_JOBCHANGE") %>% 
  diagnose_category()  %>%
  filter(ratio >= 60)
  
# Using type argument -------------------------
 dfm <- data.frame(alpabet = c(rep(letters[1:5], times = 5), "c")) 
 
# copy dfm to the DBMS with a table named TB_EXAMPLE
copy_to(con_sqlite, dfm, name = "TB_EXAMPLE", overwrite = TRUE)  
 
# extract rows that less than equal rank 10
# default of top argument is 10
con_sqlite %>% 
  tbl("TB_EXAMPLE") %>% 
  diagnose_category()
   
# extract rows that less than equal rank 2
con_sqlite %>% 
  tbl("TB_EXAMPLE") %>% 
  diagnose_category(top = 2, type = "rank")
   
# extract rows that less than equal rank 2
# default of type argument is "rank"
con_sqlite %>% 
  tbl("TB_EXAMPLE") %>% 
  diagnose_category(top = 2)
 
 # extract only 2 rows
con_sqlite %>% 
  tbl("TB_EXAMPLE") %>% 
  diagnose_category(top = 2, type = "n")

# Disconnect DBMS   
DBI::dbDisconnect(con_sqlite)

bit2r/kodlookr documentation built on Dec. 19, 2021, 9:49 a.m.