createArrowFiles | R Documentation |
This function will create ArrowFiles from input files. These ArrowFiles are the main constituent for downstream analysis in ArchR.
createArrowFiles( inputFiles = NULL, sampleNames = names(inputFiles), outputNames = sampleNames, validBarcodes = NULL, geneAnnotation = getGeneAnnotation(), genomeAnnotation = getGenomeAnnotation(), minTSS = 4, minFrags = 1000, maxFrags = 1e+05, QCDir = "QualityControl", nucLength = 147, promoterRegion = c(2000, 100), TSSParams = list(), excludeChr = c("chrM", "chrY"), nChunk = 5, bcTag = "qname", gsubExpression = NULL, bamFlag = NULL, offsetPlus = 4, offsetMinus = -5, addTileMat = TRUE, TileMatParams = list(), addGeneScoreMat = TRUE, GeneScoreMatParams = list(), force = FALSE, threads = getArchRThreads(), parallelParam = NULL, subThreading = TRUE, verbose = TRUE, cleanTmp = TRUE, logFile = createLogFile("createArrows"), filterFrags = NULL, filterTSS = NULL )
inputFiles |
A character vector containing the paths to the input files to use to generate the ArrowFiles. These files can be in one of the following formats: (i) scATAC tabix files, (ii) fragment files, or (iii) bam files. |
sampleNames |
A character vector containing the names to assign to the samples that correspond to the |
outputNames |
The prefix to use for output files. Each input file should receive a unique output file name. This list should be in the same order as "inputFiles". For example, if the predix is "PBMC" the output file will be named "PBMC.arrow" |
validBarcodes |
A list of valid barcode strings to be used for filtering cells read from each input file
(see |
geneAnnotation |
The geneAnnotation (see |
genomeAnnotation |
The genomeAnnotation (see |
minTSS |
The minimum numeric transcription start site (TSS) enrichment score required for a cell to pass filtering for use
in downstream analyses. Cells with a TSS enrichment score greater than or equal to |
minFrags |
The minimum number of mapped ATAC-seq fragments required per cell to pass filtering for use in downstream analyses.
Cells containing greater than or equal to |
maxFrags |
The maximum number of mapped ATAC-seq fragments required per cell to pass filtering for use in downstream analyses.
Cells containing greater than or equal to |
QCDir |
The relative path to the output directory for QC-level information and plots for each sample/ArrowFile. |
nucLength |
The length in basepairs that wraps around a nucleosome. This number is used for identifying fragments as sub-nucleosome-spanning, mono-nucleosome-spanning, or multi-nucleosome-spanning. |
promoterRegion |
A integer vector describing the number of basepairs upstream and downstream c(upstream, downstream) of the TSS to include as the promoter region for downstream calculation of things like the fraction of reads in promoters (FIP). |
TSSParams |
A list of parameters for computing TSS Enrichment scores. This includes the |
excludeChr |
A character vector containing the names of chromosomes to be excluded from downstream analyses. In most human/mouse analyses, this includes the mitochondrial DNA (chrM) and the male sex chromosome (chrY). This does, however, not exclude the corresponding fragments from being stored in the ArrowFile. |
nChunk |
The number of chunks to divide each chromosome into to allow for low-memory parallelized reading of the |
bcTag |
The name of the field in the input bam file containing the barcode tag information. See |
gsubExpression |
A regular expression used to clean up the barcode tag string read in from a bam file. For example, if the barcode is appended to the readname or qname field like for the mouse atlas data from Cusanovic* and Hill* et al. (2018), the gsubExpression would be ":.*". This would retrieve the string after the colon as the barcode. |
bamFlag |
A vector of bam flags to be used for reading in fragments from input bam files. Should be in the format of a
|
offsetPlus |
The numeric offset to apply to a "+" stranded Tn5 insertion to account for the precise Tn5 binding site. See Buenrostro et al. Nature Methods 2013. |
offsetMinus |
The numeric offset to apply to a "-" stranded Tn5 insertion to account for the precise Tn5 binding site. See Buenrostro et al. Nature Methods 2013. |
addTileMat |
A boolean value indicating whether to add a "Tile Matrix" to each ArrowFile. A Tile Matrix is a counts matrix that, instead of using peaks, uses a fixed-width sliding window of bins across the whole genome. This matrix can be used in many downstream ArchR operations. |
TileMatParams |
A list of parameters to pass to the |
addGeneScoreMat |
A boolean value indicating whether to add a Gene-Score Matrix to each ArrowFile. A Gene-Score Matrix uses ATAC-seq signal proximal to the TSS to estimate gene activity. |
GeneScoreMatParams |
A list of parameters to pass to the |
force |
A boolean value indicating whether to force ArrowFiles to be overwritten if they already exist. |
threads |
The number of threads to be used for parallel computing. |
parallelParam |
A list of parameters to be passed for biocparallel/batchtools parallel computing. |
subThreading |
A boolean determining whether possible use threads within each multi-threaded subprocess if greater than the number of input samples. |
verbose |
A boolean value that determines whether standard output should be printed. |
logFile |
The path to a file to be used for logging ArchR output. |
cleamTmp |
A boolean value that determines whether to clean temp folder of all intermediate ".arrow" files. |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.