createArrowFiles: Create Arrow Files
In haibol2016/ArchR_debug: Analyzing single-cell regulatory chromatin in R.

createArrowFiles

R Documentation

Create Arrow Files

Description

This function will create ArrowFiles from input files. These ArrowFiles are the main constituent for downstream analysis in ArchR.

Usage

createArrowFiles(
  inputFiles = NULL,
  sampleNames = names(inputFiles),
  outputNames = sampleNames,
  validBarcodes = NULL,
  geneAnnotation = getGeneAnnotation(),
  genomeAnnotation = getGenomeAnnotation(),
  minTSS = 4,
  minFrags = 1000,
  maxFrags = 1e+05,
  QCDir = "QualityControl",
  nucLength = 147,
  promoterRegion = c(2000, 100),
  TSSParams = list(),
  excludeChr = c("chrM", "chrY"),
  nChunk = 5,
  bcTag = "qname",
  gsubExpression = NULL,
  bamFlag = NULL,
  offsetPlus = 4,
  offsetMinus = -5,
  addTileMat = TRUE,
  TileMatParams = list(),
  addGeneScoreMat = TRUE,
  GeneScoreMatParams = list(),
  force = FALSE,
  threads = getArchRThreads(),
  parallelParam = NULL,
  subThreading = TRUE,
  verbose = TRUE,
  cleanTmp = TRUE,
  logFile = createLogFile("createArrows"),
  filterFrags = NULL,
  filterTSS = NULL
)

Arguments

`inputFiles`	A character vector containing the paths to the input files to use to generate the ArrowFiles. These files can be in one of the following formats: (i) scATAC tabix files, (ii) fragment files, or (iii) bam files.
`sampleNames`	A character vector containing the names to assign to the samples that correspond to the `inputFiles`. Each input file should receive a unique sample name. This list should be in the same order as `inputFiles`.
`outputNames`	The prefix to use for output files. Each input file should receive a unique output file name. This list should be in the same order as "inputFiles". For example, if the predix is "PBMC" the output file will be named "PBMC.arrow"
`validBarcodes`	A list of valid barcode strings to be used for filtering cells read from each input file (see `getValidBarcodes()` for 10x fragment files).
`geneAnnotation`	The geneAnnotation (see `createGeneAnnotation()`) to associate with the ArrowFiles. This is used downstream to calculate TSS Enrichment Scores etc.
`genomeAnnotation`	The genomeAnnotation (see `createGenomeAnnotation()`) to associate with the ArrowFiles. This is used downstream to collect chromosome sizes and nucleotide information etc.
`minTSS`	The minimum numeric transcription start site (TSS) enrichment score required for a cell to pass filtering for use in downstream analyses. Cells with a TSS enrichment score greater than or equal to `minTSS` will be retained. TSS enrichment score is a measurement of signal-to-background in ATAC-seq.
`minFrags`	The minimum number of mapped ATAC-seq fragments required per cell to pass filtering for use in downstream analyses. Cells containing greater than or equal to `minFrags` total fragments wll be retained.
`maxFrags`	The maximum number of mapped ATAC-seq fragments required per cell to pass filtering for use in downstream analyses. Cells containing greater than or equal to `maxFrags` total fragments wll be retained.
`QCDir`	The relative path to the output directory for QC-level information and plots for each sample/ArrowFile.
`nucLength`	The length in basepairs that wraps around a nucleosome. This number is used for identifying fragments as sub-nucleosome-spanning, mono-nucleosome-spanning, or multi-nucleosome-spanning.
`promoterRegion`	A integer vector describing the number of basepairs upstream and downstream c(upstream, downstream) of the TSS to include as the promoter region for downstream calculation of things like the fraction of reads in promoters (FIP).
`TSSParams`	A list of parameters for computing TSS Enrichment scores. This includes the `window` which is the size in basepairs of the window centered at each TSS (default 101), the `flank` which is the size in basepairs of the flanking window (default 2000), and the `norm` which describes the size in basepairs of the flank window to be used for normalization of the TSS enrichment score (default 100). For example, given `window = 101, flank = 2000, norm = 100`, the accessibility within the 101-bp surrounding the TSS will be normalized to the accessibility in the 100-bp bins from -2000 bp to -1901 bp and 1901:2000.
`excludeChr`	A character vector containing the names of chromosomes to be excluded from downstream analyses. In most human/mouse analyses, this includes the mitochondrial DNA (chrM) and the male sex chromosome (chrY). This does, however, not exclude the corresponding fragments from being stored in the ArrowFile.
`nChunk`	The number of chunks to divide each chromosome into to allow for low-memory parallelized reading of the `inputFiles`. Higher numbers reduce memory usage but increase compute time.
`bcTag`	The name of the field in the input bam file containing the barcode tag information. See `ScanBam` in Rsamtools.
`gsubExpression`	A regular expression used to clean up the barcode tag string read in from a bam file. For example, if the barcode is appended to the readname or qname field like for the mouse atlas data from Cusanovic* and Hill* et al. (2018), the gsubExpression would be ":.*". This would retrieve the string after the colon as the barcode.
`bamFlag`	A vector of bam flags to be used for reading in fragments from input bam files. Should be in the format of a `scanBamFlag` passed to `ScanBam` in Rsamtools.
`offsetPlus`	The numeric offset to apply to a "+" stranded Tn5 insertion to account for the precise Tn5 binding site. See Buenrostro et al. Nature Methods 2013.
`offsetMinus`	The numeric offset to apply to a "-" stranded Tn5 insertion to account for the precise Tn5 binding site. See Buenrostro et al. Nature Methods 2013.
`addTileMat`	A boolean value indicating whether to add a "Tile Matrix" to each ArrowFile. A Tile Matrix is a counts matrix that, instead of using peaks, uses a fixed-width sliding window of bins across the whole genome. This matrix can be used in many downstream ArchR operations.
`TileMatParams`	A list of parameters to pass to the `addTileMatrix()` function. See `addTileMatrix()` for options.
`addGeneScoreMat`	A boolean value indicating whether to add a Gene-Score Matrix to each ArrowFile. A Gene-Score Matrix uses ATAC-seq signal proximal to the TSS to estimate gene activity.
`GeneScoreMatParams`	A list of parameters to pass to the `addGeneScoreMatrix()` function. See `addGeneScoreMatrix()` for options.
`force`	A boolean value indicating whether to force ArrowFiles to be overwritten if they already exist.
`threads`	The number of threads to be used for parallel computing.
`parallelParam`	A list of parameters to be passed for biocparallel/batchtools parallel computing.
`subThreading`	A boolean determining whether possible use threads within each multi-threaded subprocess if greater than the number of input samples.
`verbose`	A boolean value that determines whether standard output should be printed.
`logFile`	The path to a file to be used for logging ArchR output.
`cleamTmp`	A boolean value that determines whether to clean temp folder of all intermediate ".arrow" files.