create.DN: Create Data Nuggets

Description Usage Arguments Details Value Author(s) References Examples

View source: R/createDN.R

Description

This function draws a random sample of observations from a large dataset and creates data nuggets, a type of representative sample of the dataset, using a specified distance metric.

Usage

1
2
3
4
5
6
7
8
create.DN(x,
          RS.num = 2.5*(10^5),
          DN.num1 = 10^4,
          DN.num2 = 2000,
          dist.metric = "euclidean",
          seed = 291102,
          no.cores = (detectCores() - 1),
          make.pbs = TRUE)

Arguments

x

A data matrix (of class matrix, data.frame, or data.table) containing only entries of class numeric.

RS.num

The number of observations to sample from the data matrix. Must be of class numeric.

DN.num1

The number of initial data nugget centers to create. Must be of class numeric.

DN.num2

The number of data nuggets to create. Must be of class numeric.

dist.metric

The distance metric used to create the initial centers of data nuggets. Must be 'euclidean' or 'manhattan'.

seed

Random seed for replication. Must be of class numeric.

no.cores

Number of cores used for parallel processing. If '0' then parallel processing is not used. Must be of class numeric.

make.pbs

Print progress bars? Must be TRUE or FALSE.

Details

Data nuggets are a representative sample meant to summarize Big Data by reducing a large dataset to a much smaller dataset by eliminating redundant points while also preserving the peripheries of the dataset. Each data nugget is defined by a center (location), weight (importance), and scale (internal variability). This function creates data nuggets using Algorithm 1 provided in the reference.

Value

An object of class datanugget:

Data Nuggets

DN.num by (ncol(x)+3) data frame containing the information for the data nuggets created (index, center, weight, scale).

Data Nugget Assignments

Vector of length nrow(x) containing the data nugget assignment of each observation in x.

Author(s)

Traymon Beavers, Javier Cabrera, Mariusz Lubomirski

References

Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure (Submitted for Publication, 2019)

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
      ## small example
      X = cbind.data.frame(rnorm(10^4),
                           rnorm(10^4),
                           rnorm(10^4))

      suppressMessages({

        my.DN = create.DN(x = X,
                          RS.num = 10^3,
                          DN.num1 = 500,
                          DN.num2 = 250,
                          no.cores = 0,
                          make.pbs = FALSE)

      })

      my.DN$`Data Nuggets`
      my.DN$`Data Nugget Assignments`

    

      ## large example
      X = cbind.data.frame(rnorm(10^6),
                           rnorm(10^6),
                           rnorm(10^6),
                           rnorm(10^6),
                           rnorm(10^6))

      my.DN = create.DN(x = X,
                        RS.num = 10^5,
                        DN.num1 = 10^4,
                        DN.num2 = 2000)

      my.DN$`Data Nuggets`
      my.DN$`Data Nugget Assignments`

    

datanugget documentation built on Jan. 25, 2020, 1:07 a.m.