Load Balancing a Dataset

Description

These functions will rearrange data for all processors such that the data amount of each processor is nearly equal.

Usage

1
2
3
4
5
6
7
balance.info(X.gbd, comm = .pbd_env$SPMD.CT$comm,
  gbd.major = .pbd_env$gbd.major, method = .pbd_env$divide.method[1])

load.balance(X.gbd, bal.info = NULL, comm = .pbd_env$SPMD.CT$comm,
  gbd.major = .pbd_env$gbd.major)

unload.balance(new.X.gbd, bal.info, comm = .pbd_env$SPMD.CT$comm)

Arguments

X.gbd

a GBD data matrix (converted if not).

comm

a communicator number.

gbd.major

1 for row-major storage, 2 for column-major.

method

"block.cyclic" or "block0".

bal.info

a returned object from balance.info.

new.X.gbd

a GBD data matrix or vector

Details

X.gbd is the data matrix with dimension N.gbd * p and exists on all processors where N.gbd may be vary across processors. If X.gbd is a vector, then it is converted to a N.gbd * 1 matrix.

balance.info provides the information how to balance data set such that all processors own similar amount of data. This information may be also useful for tracking where the data go or from.

load.balance does the job to transfer data from one processor with more data to the other processors with less data based on the balance information balance.info.

unload.balance is the inversed function of load.balance, and it takes the same information bal.info to reverse the balanced result back to the original order. new.X.gbd is usually the output of load.balance{X.gbd} or other results of further computing of it. Again, if new.X.gbd is a vector, then it is converted to an one column matrix.

Value

balance.info returns a list contains two data frames and two vectors.

Two data frames are send and recv for sending and receiving data. Each data frame has two columns org and belong for where data original in and new belongs. Number of row of send should equal to the N.gbd, and number of row of recv should be nearly equal to n = N / COMM.SIZE where N is the total observations of all processors.

Two vectors are N.allgbd and new.N.allgbd which are all numbers of rows of X.gbd on all processes before and after load balance, correspondingly. Both have length equals to comm.size(comm).

load.balance returns a matrix for each processor and the matrix has the dimension nearly equal to n * p.

unload.balance returns a matrix with the same length/rows as the original number of row of X.gbd.

Warning(s)

These function only support total object length is less than 2^32 - 1 for machines using 32-bit integer.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
## Not run: 
# Save code in a file "demo.r" and run in 4 processors by
# > mpiexec -np 4 Rscript demo.r

### Setup environment.
library(pbdDEMO, quiet = TRUE)

### Generate an example data.
N.gbd <- 5 * (comm.rank() * 2)
X.gbd <- rnorm(N.gbd * 3)
dim(X.gbd) <- c(N.gbd, 3)
comm.cat("X.gbd[1:5,]\n", quiet = TRUE)
comm.print(X.gbd[1:5,], rank.print = 1, quiet = TRUE)

bal.info <- balance.info(X.gbd)
new.X.gbd <- load.balance(X.gbd, bal.info)
org.X.gbd <- unload.balance(new.X.gbd, bal.info)

comm.cat("org.X.gbd[1:5,]\n", quiet = TRUE)
comm.print(org.X.gbd[1:5,], rank.print = 1, quiet = TRUE)
if(any(org.X.gbd - X.gbd != 0)){
  cat("Unbalance fails in the rank ", comm.rank(), "\n")
}

### Quit.
finalize()

## End(Not run)