cpu.pca: CPU usage metrics for distributed PCA algorithm

Description Usage Format Source References

Description

Consumption metrics gathered during an execution of the Distributed Machine Learning algorithm Principal Component Analysis (PCA) in an eigth-node cluster, by using the Spark framework.

Usage

1

Format

A data frame containing 938 observations and four dimensions:

  1. user: CPU usage by the algorithm

  2. system: CPU usage spent by Operating System (O.S.)

  3. iowait: waiting time for Input/Output (I/O) operations

  4. softirq: CPU time spent by software interrupt requests

The values comprise the domain from 0 to 100, for all dimensions. The dataset contains zero-values, however there is no missing or null values.

** A spark cluster of N nodes has 1 (one) master node and N-1 slave nodes.

Source

The data was measured and collected by the author by using Intel HiBench benchmark framework in a eigth-node Spark cluster hosted in Google Cloud DataProc engine. Each node had 16-core CPU annd 106 GB RAM. The algorithm Principal Component Analytsis had consumed 11.2 min of runtime to execute over a sinteticaly generated dataset totalizing 1.68 Gigabytes.

References

J.Shlens,A Tutorial on Principal Component Analysis, Epidemiology, vol. 2, no. c, pp. 223???228, 2005.

Jolliffe, I.T.: Principal Component Analysis, Second Edition. Encycl. Stat. Behav. Sci. 30, 487 (2002).

S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang, The HiBench benchmark suite: Characterization of the MapReduce-based data analy- sis, in 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), 2010, pp. 41???51.


gama documentation built on May 2, 2019, 6:45 a.m.