Utility functions for least squares estimation in large data sets.

1 2 3 4 |

`B` |
a squared matrix. |

`symmetric` |
logical, is |

`tol.values` |
tolerance to be consider eigenvalues equals to zero. |

`tol.vectors` |
tolerance to be consider eigenvectors equals to zero. |

`out.B` |
Have the matrix B to be returned? |

`method` |
the method to check for singularity. By default is "eigen", and an eigendecomposition of X'X is made. The "Cholesky" method is faster than "eigen" and does not use tolerance, but the former seems to be more stable for opportune tolerance values. |

`X` |
the model matrix. |

`w` |
a weights vector. |

`sparse` |
logical, is |

`sparselim` |
a real in the interval [0; 1]. It indicates the minimal proportion of zeroes in the data matrix X in order to consider X as sparse |

eigendec Logical. Do you want to investigate on rank of X? You may set to

`row.chunk` |
an integer which indicates the total rows number
compounding each of the first g-1 blocks. If |

`camp` |
the sample proportion of elements of X on which the survey will be based. |

Function `control`

makes an eigendecomposition of B according established values of tolerance.
Function `cp`

makes the cross-product X'X by partitioning X in row-blocks.
When an optimized BLAS, such as ATLAS, is not installed, the function represents an attempt
to speed up the calculation and avoid overflows with medium-large data sets loaded in R memory.
The results depending on processor type. Good results are obtained, for example, with an AMD Athlon
dual core 1.5 Gb RAM by setting `row.chunk`

to some value less than 1000. Try the example below
by changing the matrix size and the value of `row.chunk`

. If the matrix X is sparse, it will have
class "dgCMatrix" (the package Matrix is required) and the cross-product will be made without
partitioning. However, good performances are usually obtained with a very
high zeroes proportion.
Function `is.sparse`

makes a quick sample survey on sample proportion of zeroes in X.

for the function `control`

, a list with the following elements:

`XTX` |
the matrix product B without singularities (if there are). |

`rank` |
the rank of B |

`pivot` |
an ordered set of column indeces of B with, if the case, the last |

for the function `cp`

:

`new.B` |
the matrix product X'X (weighted, if |

for the function `is.sparse`

:

`sparse` |
a logical value which indicates if the sample proportion of zeroes is
greater than |

Marco ENEA

eigen, chol, qr, crossprod

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | ```
#### example 1.
n <- 100000
k <- 100
x <- round(matrix(rnorm(n*k),n,k),digits=4)
y <- rnorm(n)
# if an optimized BLAS is not installed, depending on processor type, cp() may be
# faster than crossprod() for large matrices.
system.time(a1 <- crossprod(x))
system.time(a2 <- cp(x,,row.chunk = 500))
all.equal(a1, a2)
#### example 2.1.
n <- 100000
k <- 10
x <- matrix(rnorm(n*k),n,k)
x[,2] <- x[,1] + 2*x[,3] # x has rank 9
y <- rnorm(n)
# estimation by least squares
A <- function(){
A1 <- control(crossprod(x))
ok <- A1$pivot[1:A1$rank]
as.vector(solve(A1$XTX,crossprod(x[,ok],y)))
}
# estimation by QR decomposition
B <- function(){
B1 <- qr(x)
qr.solve(x[,B1$pivot[1:B1$rank]],y)
}
system.time(a <- A())
system.time(b <- B())
all.equal(a,b)
### example 2.2
x <- matrix(c(1:5, (1:5)^2), 5, 2)
x <- cbind(x, x[, 1] + 3*x[, 2])
m <- crossprod(x)
qr(m)$rank # is 2, as it should be
control(m,method="eigen")$rank # is 2, as it should be
control(m,method="Cholesky")$rank # is wrong
### example 3.
n <- 10000
fat1 <- gl(20,500)
y <- rnorm(n)
da <- data.frame(y,fat1)
m <- model.matrix(y ~ factor(fat1),data = da)
is.sparse(m)
``` |

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.

Please suggest features or report bugs with the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.