bigPCAToFile: Run a PCA over a set of eBird records
In eliotmiller/ebirdr: Manipulate eBird data

Description Usage Arguments Details Value Examples

Combine a set of large eBird files and run a PCA over a specified set of columns.

1 2	bigPCAToFile(comparisons, read.wd, write.wd, aux.wd, columns, log.cols, scale.center = FALSE, SVD = FALSE, axes)

`comparisons`	List of character vectors. Each character vector lists the species that will be combined before the PCA is run. The name of the character vector (i.e. the name of the elements of the list), are used to save out summaries of the PCA, e.g. percent variance explained by each axis. Probably more complicated than it needs to be, see examples.
`read.wd`	Path to directory where the eBird records with environmental data are stored.
`write.wd`	Path to directory where PCA results will be saved.
`aux.wd`	Path to directory where the summarized PCA results will be saved.
`columns`	Character vector of column names in the existing eBird records that you will run the PCA over.
`log.cols`	The name of the columns to log transform. Somewhat inflexible. A constant of 0.01 will be added before taking the natural log.
`scale.center`	Default is FALSE, i.e. a covariance matrix PCA will be run. To run a correlation matrix PCA, set to TRUE. Scaling and centering are accomplished manually before the PCA is run, which I think might be slow to run for really large files. That part is untested, and it may be quicker to run it over a big.matrix– currently this just runs over a regular matrix before converting afterwards to a big matrix.
`SVD`	Default is FALSE. The big.PCA function supposedly runs much faster if SVD is set to TRUE, but the results are not exactly the same as regular prcomp if it is set to TRUE. Specifically, the standard deviation of each axis is the same with SVD set to TRUE, which means(?) that each axis is equally important, which potentially seems odd to me for later hypervolume calculations.
`axes`	How many axes from the PCA to retain and return.

Depending on the size of the files, it can be nearly impossible to run a standard base PCA over the environmental variables in a set of eBird files. This function uses functions from the data.table, bigmemory and bigpca packages to run a PCA. For small files it is slower than a base R PCA, but it theoretically can handle very large inputs that would normally crash R. I have a version of this file that runs in parallel, but not sure we need it, and may swamp the RAM. Assess whether worth including parallel option at some point. There is a major assumption built into the 'comparisons' list above, which is that the files in read.wd begin with the species' names, with underscores between the genus and species. There currently is no version of this function that doesn't write results directly to file.

Nothing to the workspace. For each species in 'comparisons', this function saves a csv with the scores of all observations for that species into write.wd, and a summary table with the proportion of variance explained, etc., into aux.wd.

#define the comparisons you'll run it over
#temp <- strsplit(list.files(), "_")
#woodpeckers <- paste(lapply(temp, "[", 1), lapply(temp, "[", 2), sep="_")
#woodpeckers <- list(woodpeckers)
#names(woodpeckers) <- "woodpeckers"

eliotmiller/ebirdr documentation built on May 14, 2019, 10:33 a.m.