fret_stats: Calculate FRET test statistics

Usage Arguments Details

Usage

1
2
3
4
5
6
7
8
fret_stats(pheno_file_list, trait_file, mode = c("dry_run", "s0_only",
  "full"), s0, zmin, z0 = if (!missing(zmin)) 0.3 * zmin,
  s0_est_size = pheno_file_list[1], pheno_transformation = NULL,
  trait = "x", covariates = c(), sample = "name", stat_type = c("huber",
  "lm", "qp", "custom"), stat_fun = NULL, resid_fun = NULL, libs = c(),
  seed, n_perm = 0, bandwidth = 151, smoother = c("ksmooth_0", "ksmooth",
  "none"), chunksize = 1e+05, which_chunks = "all", temp_dir = "./",
  temp_prefix = NULL, labels = pheno_file_list, cores = 1)

Arguments

pheno_file_list

Name or vector of names of genomic phenotype files

trait_file

Name of trait file

mode

One of "dry_run", "s0_only", or "full". See description for details.

s0

Variance inflation constant. If missing, s0 will be estimated using a random chunk of the data. (See description and s0_est_size argument)

zmin

Minimum threshold. If missing, will be set to the 90th percentile of a sample of test statistics (See details below).

z0

Merging threshold. If missing z0 = 0.3*zmin.

s0_est_size

Number of test statistics used to estimate s0. Alternatively, s0_est_size may be a character string giving a phenotype file name or may be "all" to use all files.

pheno_transformation

If the phenotype is to be transformed, provide a function taking one argument (the vector of phenotype) and outputting the transformed phenotype.

trait

Name of trait (should match header of trait_file)

covariates

List of covariates to adjust for (names should match header of trait_file)

stat_type

Type of test statistis. May be one of "huber" or "lm".

seed

Seed (used for permutations). If missing, a random seed will be chosen and stored.

n_perm

Number of permutations. If n.perm=0, only test statistics for unpermuted data will be calculated.

bandwidth

Smoothing bandwidth

chunksize

Size of chunks to read at one time

which_chunks

Either a vector of chunk numbers or "all". See about running FRET in parallel or on a cluster.

temp_dir

Directory to write temporary chunk output to

temp_prefix

Prefix to use for chunk output

labels

Vector of labels for each phenotype file. These will be used in results tables and also in temporary file names.

cores

Number of cores to use. Using more than one requires the parallel package.

huber_maxit

Maximum iterations for Huber estimator.

smooother

Choice of smoother for smoothing test statistics. Can be one of "ksmooth_0" or "ksmooth". See details below.

Details

mode: The function can run in one of three modes. If mode="dry_run" it will report information about the data provided, the model and the number of chunks and then exit. If mode = "s0_only" it will report this information, estimate s0 and zmin and exit. If mode="full" it will proceed to calculate all test statistics and permutation test statistics for the specified chunks.

Running on a cluster or in parallel: The which_chunks argument is intended to facilitate breaking a very large job into many small jobs that can be easily submitted to a cluster. It can also help with resuming an analysis that was interrupted. To limit memory requirements, only chunks of size chunksize will be read in and analyzed at one time. Results of these analyses are then written to disk in files named temp_dir/temp_prefix-label.chunknum.RDS. It is important to make sure that temp_dir has enough space to store lots of test statistics. If which_chunks="all", these temporary files will be automatically aggregated into a single set of results. If the analysis is conducted over many jobs, the uster will need to call the collect_fret_stats function to do this themselves(see documentation for collect_fret_stats). In addition to breaking chunks over many nodes or many jobs, the cores parameter can be used to perform calculations using multiple cores via the parallel package.

Smoother choice: There are three options for smoothing test statistics. "ksmooth_0" is a box kernel smoother for observations made at integer positions. It assumes that observations at missing positions are equal to 0. This is an appropriate smoother choice for DNase-seq and similar data types. In DNase-seq data, if a position is not present in the data, all samples have 0 cleavages observed at the position so the test statistic is equal to 0. "ksmooth" is a box kernel smoother that assumes observations at missing positions are missing. This is appropriate for bisulfite sequencing data.

Estimating s0, zmin, and z0: If s0 is not provided, it will be estimated from the data (if mode = "s0_only" or FALSE). The function will use an amount of data specified by the s0_est_size argument. If this argument is a file name, all the data in that file will be used. If it is an integer, the number of data points specified will be used. If fewer than 1,000,000 data points are used, the estimate might be unstable and a warning will be given. If zmin is missing, it will be set to the 90th percentile of the test statistics in the data sample (after correcting using s0). If z0 is missing it will be set to 0.3*zmin.


jean997/fret documentation built on May 18, 2019, 11:43 p.m.