madlib.summary: Data summary function

Description Usage Arguments Value Author(s) See Also Examples

Description

‘summary’ is a generic function used to produce summary statistics of any data table. The function invokes particular methods' from the MADlib library to provide an overview of the data. The computation is parallelized by MADlib if the connected database is Greenplum database.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
madlib.summary(x, target.cols = NULL, grouping.cols = NULL,
               get.distinct = TRUE, get.quartiles = TRUE,
               ntile = NULL, n.mfv = 10, estimate = TRUE,
               interactive = FALSE)

## S4 method for signature 'db.obj'
summary(object, target.cols = NULL, grouping.cols = NULL,
               get.distinct = TRUE, get.quartiles = TRUE,
               ntile = NULL, n.mfv = 10, estimate = TRUE,
               interactive = FALSE)

Arguments

x,object

An object of db.obj class. Currently, this parameter is mandatory. If it is an object of class db.Rquery or db.view, a temporary table will be created, and further computation will be done on the temporary table. After the computation, the temporary will be dropped from the corresponding database.

target.cols

Vector of string. Default value is NULL. Column names in the table for which the summary is desired. When NULL all summary of all columns are returned.

grouping.cols

List of string. Default value is NULL. Column names in the table by which to group the data. When NULL no grouping of data is performed.

get.distinct

Logical. Default value is TRUE. Are distinct values required in the summary?

get.quartiles

Logical. Default value is TRUE. Are quartile values required in the summary?

ntile

Vector of floats. Default value is NULL. Vector of quantiles required as part of the summary.

n.mfv

Integer. Default value is 10. How many ‘most-frequent-values’ (MFVs) to compute?

estimate

Logical. Default value is TRUE. Should an estimated computation be used to compute values for distincts and MFVs (as opposed to an exact but slow method)?

interactive

Logical. Default is FALSE. If x is of type db.view, then extracting data from it would actually compute the view, which might take a longer time, especially for large data sets. When interactive is TRUE, this function will ask the user whether to continue to extract data from the view.

Value

A data.frame object. Each column in the table (or target.cols) is a row in the result data frame. Each column of the data frame is described below:

group_by

character. Group-by column names (NA if none provided)

group_by_value

character. Values of the group-by columns (NA if no grouping)

target_column

character. Targeted column values for which summary is requested

column_number

integer. Physical column number for the target column in the database

data_type

character. Data type of target column. Standard database descriptors will be displayed

row_count

numeric. Number of rows for the target column

distinct_values

numeric. Number of distinct values in the target column

missing_values

numeric. Number of missing values in the target column

blank_values

numeric. Number of blank values (blanks are defined as values with only whitespace)

fraction_missing

numeric. Percentage of total rows that are missing. Will be expressed as a decimal (e.g. 0.3)

fraction_blank

numeric. Percentage of total rows that are blank. Will be expressed as a decimal (e.g. 0.3)

mean

numeric. Mean value of target column (if target is numeric, else NA)

variance

numeric. Variance of target columns (if target is numeric, else NA for strings)

min

numeric. Min value of target column (for strings this is the length of the shortest string)

max

numeric. Max value of target column (for strings this is the length of the longest string)

first_quartile

numeric. First quartile (25th percentile, valid only for numeric columns)

median

numeric. Median value of target column (valid only for numeric columns)

third_quartile

numeric. Third quartile (75th percentile, valid only for numeric columns)

quantile_array

numeric. Percentile values corresponding to ntile_array

most_frequent_values

character. Most frequent values

mfv_frequencies

character. Frequency of the most frequent values

The data.frame has an extra attribute names "summary", which is a db.data.frame object and wraps the result table created by MADlib inside the database. One can access this object using attr(res, "summary"), where res is the result of this function.

Author(s)

Author: Predictive Analytics Team at Pivotal Inc.

Maintainer: Frank McQuillan, Pivotal Inc. fmcquillan@pivotal.io

See Also

madlib.lm, madlib.glm, madlib.arima are MADlib wrapper functions.

delete safely deletes the result of this function.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
## Not run: 
## get the help for a method
## help("madlib.summary")


## set up the database connection
## Assume that .port is port number and .dbname is the database name
cid <- db.connect(port = .port, dbname = .dbname, verbose = FALSE)

delete("abalone", conn.id = cid)
as.db.data.frame(abalone, "abalone", conn.id = cid, verbose = FALSE)
x  <- db.data.frame("abalone", conn.id = cid, verbose = FALSE)

lk(x, 10)

# madlib.summary
summary_result  <- madlib.summary(x)
print(summary_result)

# madlib.summary
summary_result  <- madlib.summary(x, target.cols=c('rings', 'length', 'diameter'),
                                    grouping.cols=c('sex'),
                                    get.distinct=FALSE,
                                    get.quartiles=TRUE,
                                    ntile=c(0.1, 0.6),
                                    n.mfv=5,
                                    estimate=TRUE,
                                    interactive=FALSE)

print(summary_result)

db.disconnect(cid, verbose = FALSE)

## End(Not run)

PivotalR documentation built on March 13, 2021, 1:06 a.m.