Description Usage Arguments Value Author(s) See Also Examples
‘summary’ is a generic function used to produce summary statistics of any data table. The function invokes particular methods' from the MADlib library to provide an overview of the data. The computation is parallelized by MADlib if the connected database is Greenplum database.
1 2 3 4 5 6 7 8 9 10 | madlib.summary(x, target.cols = NULL, grouping.cols = NULL,
get.distinct = TRUE, get.quartiles = TRUE,
ntile = NULL, n.mfv = 10, estimate = TRUE,
interactive = FALSE)
## S4 method for signature 'db.obj'
summary(object, target.cols = NULL, grouping.cols = NULL,
get.distinct = TRUE, get.quartiles = TRUE,
ntile = NULL, n.mfv = 10, estimate = TRUE,
interactive = FALSE)
|
x,object |
An object of |
target.cols |
Vector of string. Default value is NULL. Column names in the table for which the summary is desired. When NULL all summary of all columns are returned. |
grouping.cols |
List of string. Default value is NULL. Column names in the table by which to group the data. When NULL no grouping of data is performed. |
get.distinct |
Logical. Default value is TRUE. Are distinct values required in the summary? |
get.quartiles |
Logical. Default value is TRUE. Are quartile values required in the summary? |
ntile |
Vector of floats. Default value is NULL. Vector of quantiles required as part of the summary. |
n.mfv |
Integer. Default value is 10. How many ‘most-frequent-values’ (MFVs) to compute? |
estimate |
Logical. Default value is TRUE. Should an estimated computation be used to compute values for distincts and MFVs (as opposed to an exact but slow method)? |
interactive |
Logical. Default is FALSE. If |
A data.frame
object. Each column in the table (or target.cols
)
is a row in the result data frame. Each column of the data frame is described below:
group_by |
character. Group-by column names (NA if none provided) |
group_by_value |
character. Values of the group-by columns (NA if no grouping) |
target_column |
character. Targeted column values for which summary is requested |
column_number |
integer. Physical column number for the target column in the database |
data_type |
character. Data type of target column. Standard database descriptors will be displayed |
row_count |
numeric. Number of rows for the target column |
distinct_values |
numeric. Number of distinct values in the target column |
missing_values |
numeric. Number of missing values in the target column |
blank_values |
numeric. Number of blank values (blanks are defined as values with only whitespace) |
fraction_missing |
numeric. Percentage of total rows that are missing. Will be expressed as a decimal (e.g. 0.3) |
fraction_blank |
numeric. Percentage of total rows that are blank. Will be expressed as a decimal (e.g. 0.3) |
mean |
numeric. Mean value of target column (if target is numeric, else NA) |
variance |
numeric. Variance of target columns (if target is numeric, else NA for strings) |
min |
numeric. Min value of target column (for strings this is the length of the shortest string) |
max |
numeric. Max value of target column (for strings this is the length of the longest string) |
first_quartile |
numeric. First quartile (25th percentile, valid only for numeric columns) |
median |
numeric. Median value of target column (valid only for numeric columns) |
third_quartile |
numeric. Third quartile (75th percentile, valid only for numeric columns) |
quantile_array |
numeric. Percentile values corresponding to ntile_array |
most_frequent_values |
character. Most frequent values |
mfv_frequencies |
character. Frequency of the most frequent values |
The data.frame
has an extra attribute names "summary"
,
which is a db.data.frame
object and wraps the
result table created by MADlib inside the database. One can access
this object using attr(res, "summary")
, where res
is
the result of this function.
Author: Predictive Analytics Team at Pivotal Inc.
Maintainer: Frank McQuillan, Pivotal Inc. fmcquillan@pivotal.io
madlib.lm
, madlib.glm
,
madlib.arima
are MADlib
wrapper functions.
delete
safely deletes the result of this function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | ## Not run:
## get the help for a method
## help("madlib.summary")
## set up the database connection
## Assume that .port is port number and .dbname is the database name
cid <- db.connect(port = .port, dbname = .dbname, verbose = FALSE)
delete("abalone", conn.id = cid)
as.db.data.frame(abalone, "abalone", conn.id = cid, verbose = FALSE)
x <- db.data.frame("abalone", conn.id = cid, verbose = FALSE)
lk(x, 10)
# madlib.summary
summary_result <- madlib.summary(x)
print(summary_result)
# madlib.summary
summary_result <- madlib.summary(x, target.cols=c('rings', 'length', 'diameter'),
grouping.cols=c('sex'),
get.distinct=FALSE,
get.quartiles=TRUE,
ntile=c(0.1, 0.6),
n.mfv=5,
estimate=TRUE,
interactive=FALSE)
print(summary_result)
db.disconnect(cid, verbose = FALSE)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.