madlib.summary: Data summary function
In PivotalR: A Fast, Easy-to-Use Tool for Manipulating Tables in Databases and a Wrapper of MADlib

Description Usage Arguments Value Author(s) See Also Examples

‘summary’ is a generic function used to produce summary statistics of any data table. The function invokes particular methods' from the MADlib library to provide an overview of the data. The computation is parallelized by MADlib if the connected database is Greenplum database.

madlib.summary(x, target.cols = NULL, grouping.cols = NULL,
               get.distinct = TRUE, get.quartiles = TRUE,
               ntile = NULL, n.mfv = 10, estimate = TRUE,
               interactive = FALSE)

## S4 method for signature 'db.obj'
summary(object, target.cols = NULL, grouping.cols = NULL,
               get.distinct = TRUE, get.quartiles = TRUE,
               ntile = NULL, n.mfv = 10, estimate = TRUE,
               interactive = FALSE)

`x,object`	An object of `db.obj` class. Currently, this parameter is mandatory. If it is an object of class `db.Rquery` or `db.view`, a temporary table will be created, and further computation will be done on the temporary table. After the computation, the temporary will be dropped from the corresponding database.
`target.cols`	Vector of string. Default value is NULL. Column names in the table for which the summary is desired. When NULL all summary of all columns are returned.
`grouping.cols`	List of string. Default value is NULL. Column names in the table by which to group the data. When NULL no grouping of data is performed.
`get.distinct`	Logical. Default value is TRUE. Are distinct values required in the summary?
`get.quartiles`	Logical. Default value is TRUE. Are quartile values required in the summary?
`ntile`	Vector of floats. Default value is NULL. Vector of quantiles required as part of the summary.
`n.mfv`	Integer. Default value is 10. How many ‘most-frequent-values’ (MFVs) to compute?
`estimate`	Logical. Default value is TRUE. Should an estimated computation be used to compute values for distincts and MFVs (as opposed to an exact but slow method)?
`interactive`	Logical. Default is FALSE. If `x` is of type `db.view`, then extracting data from it would actually compute the view, which might take a longer time, especially for large data sets. When `interactive` is TRUE, this function will ask the user whether to continue to extract data from the view.

A data.frame object. Each column in the table (or target.cols) is a row in the result data frame. Each column of the data frame is described below:

`group_by`	character. Group-by column names (NA if none provided)
`group_by_value`	character. Values of the group-by columns (NA if no grouping)
`target_column`	character. Targeted column values for which summary is requested
`column_number`	integer. Physical column number for the target column in the database
`data_type`	character. Data type of target column. Standard database descriptors will be displayed
`row_count`	numeric. Number of rows for the target column
`distinct_values`	numeric. Number of distinct values in the target column
`missing_values`	numeric. Number of missing values in the target column
`blank_values`	numeric. Number of blank values (blanks are defined as values with only whitespace)
`fraction_missing`	numeric. Percentage of total rows that are missing. Will be expressed as a decimal (e.g. 0.3)
`fraction_blank`	numeric. Percentage of total rows that are blank. Will be expressed as a decimal (e.g. 0.3)
`mean`	numeric. Mean value of target column (if target is numeric, else NA)
`variance`	numeric. Variance of target columns (if target is numeric, else NA for strings)
`min`	numeric. Min value of target column (for strings this is the length of the shortest string)
`max`	numeric. Max value of target column (for strings this is the length of the longest string)
`first_quartile`	numeric. First quartile (25th percentile, valid only for numeric columns)
`median`	numeric. Median value of target column (valid only for numeric columns)
`third_quartile`	numeric. Third quartile (75th percentile, valid only for numeric columns)
`quantile_array`	numeric. Percentile values corresponding to ntile_array
`most_frequent_values`	character. Most frequent values
`mfv_frequencies`	character. Frequency of the most frequent values

The data.frame has an extra attribute names "summary", which is a db.data.frame object and wraps the result table created by MADlib inside the database. One can access this object using attr(res, "summary"), where res is the result of this function.

Author: Predictive Analytics Team at Pivotal Inc.

Maintainer: Frank McQuillan, Pivotal Inc. fmcquillan@pivotal.io

madlib.lm, madlib.glm, madlib.arima are MADlib wrapper functions.

delete safely deletes the result of this function.

## Not run: 
## get the help for a method
## help("madlib.summary")


## set up the database connection
## Assume that .port is port number and .dbname is the database name
cid <- db.connect(port = .port, dbname = .dbname, verbose = FALSE)

delete("abalone", conn.id = cid)
as.db.data.frame(abalone, "abalone", conn.id = cid, verbose = FALSE)
x  <- db.data.frame("abalone", conn.id = cid, verbose = FALSE)

lk(x, 10)

# madlib.summary
summary_result  <- madlib.summary(x)
print(summary_result)

# madlib.summary
summary_result  <- madlib.summary(x, target.cols=c('rings', 'length', 'diameter'),
                                    grouping.cols=c('sex'),
                                    get.distinct=FALSE,
                                    get.quartiles=TRUE,
                                    ntile=c(0.1, 0.6),
                                    n.mfv=5,
                                    estimate=TRUE,
                                    interactive=FALSE)

print(summary_result)

db.disconnect(cid, verbose = FALSE)

## End(Not run)

PivotalR documentation built on March 13, 2021, 1:06 a.m.

PivotalR index

README.md An Introduction to PivotalR

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

PivotalR
A Fast, Easy-to-Use Tool for Manipulating Tables in Databases and a Wrapper of MADlib

madlib.summary: Data summary function
In PivotalR: A Fast, Easy-to-Use Tool for Manipulating Tables in Databases and a Wrapper of MADlib

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Related to madlib.summary in PivotalR...

R Package Documentation

Browse R Packages

We want your feedback!

PivotalR A Fast, Easy-to-Use Tool for Manipulating Tables in Databases and a Wrapper of MADlib

madlib.summary: Data summary function In PivotalR: A Fast, Easy-to-Use Tool for Manipulating Tables in Databases and a Wrapper of MADlib

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Related to madlib.summary in PivotalR...

R Package Documentation

Browse R Packages

We want your feedback!

PivotalR
A Fast, Easy-to-Use Tool for Manipulating Tables in Databases and a Wrapper of MADlib

madlib.summary: Data summary function
In PivotalR: A Fast, Easy-to-Use Tool for Manipulating Tables in Databases and a Wrapper of MADlib