Find_Samples: Convenience Function to (recursively) find all files in a...

View source: R/File_finders.R

Find_SamplesR Documentation

Convenience Function to (recursively) find all files in a folder.

Description

Often, files e.g. raw sequencing FASTQ files, alignment BAM files, or IRFinder output files, are stored in a single folder under some directory structure. They can be grouped by being in common directory or having common names. Often, their sample names can be gleaned by these common names or the names of the folders in which they are contained. This function (recursively) finds all files and extracts sample names assuming either the files are named by sample names (level = 0), or that their names can be derived from the parent folder (level = 1). Higher level also work (e.g. level = 2) mean the parent folder of the parent folder of the file is named by sample names. See details section below.

Usage

Find_Samples(sample_path, suffix = ".txt.gz", level = 0)

Find_FASTQ(
  sample_path,
  paired = TRUE,
  fastq_suffix = c(".fastq", ".fq", ".fastq.gz", ".fq.gz"),
  level = 0
)

Find_Bams(sample_path, level = 0)

Find_IRFinder_Output(sample_path, level = 0)

Arguments

sample_path

The path in which to recursively search for files that match the given suffix

suffix

A vector of or or more strings that specifies the file suffix (e.g. '.bam' denotes BAM files, whereas ".txt.gz" denotes gzipped txt files).

level

Whether sample names can be found in the file names themselves (level = 0), or their parent directory (level = 1). Potentially parent of parent directory (level = 2). Support max level <= 3 (for sanity).

paired

Whether to expect single FASTQ files (of the format "sample.fastq"), or paired files (of the format "sample_1.fastq", "sample_2.fastq")

fastq_suffix

The name of the FASTQ suffix. Options are: ".fastq", ".fastq.gz", ".fq", or ".fq.gz"

Details

Paired FASTQ files are assumed to be named using the suffix _1 and _2 after their common names; e.g. sample_1.fastq, sample_2.fastq. Alternate FASTQ suffixes for Find_FASTQ() include ".fq", ".fastq.gz", and ".fq.gz".

In BAM files, often the parent directory denotes their sample names. In this case, use level = 1 to automatically annotate the sample names using Find_Bams().

IRFinder outputs two files per BAM processed. These are named by the given sample names. The text output is named "sample1.txt.gz", and the COV file is named "sample1.cov", where sample1 is the name of the sample. These files can be organised / tabulated using the function Find_IRFinder_Output. The generic function Find_Samples will organise the IRFinder text output files but exclude the COV files. Use the latter as the Experiment in CollateData if one decides to collate an experiment without linked COV files, for portability reasons.

Value

A multi-column data frame with the first column containing the sample name, and subsequent columns being the file paths with suffix as determined by suffix.

Functions

  • Find_Samples: Finds all files with the given suffix pattern. Annotates sample names based on file or parent folder names.

  • Find_FASTQ: Use Find_Samples() to return all FASTQ files in a given folder

  • Find_Bams: Use Find_Samples() to return all BAM files in a given folder

  • Find_IRFinder_Output: Use Find_Samples() to return all IRFinder output files in a given folder, including COV files

Examples

# Retrieve all BAM files in a given folder, named by sample names
bam_path <- tempdir()
example_bams(path = bam_path)
df.bams <- Find_Samples(sample_path = bam_path,
  suffix = ".bam", level = 0)
# equivalent to:
df.bams <- Find_Bams(bam_path, level = 0)

# Retrieve all IRFinder output files in a given folder,
# named by sample names

expr <- Find_IRFinder_Output(file.path(tempdir(), "IRFinder_output"))
## Not run: 

# Find FASTQ files in a directory, named by sample names
# where files are in the form:
# - "./sample_folder/sample1.fastq"
# - "./sample_folder/sample2.fastq"

Find_FASTQ("./sample_folder", paired = FALSE, fastq_suffix = ".fastq")

# Find paired gzipped FASTQ files in a directory, named by parent directory
# where files are in the form:
# - "./sample_folder/sample1/raw_1.fq.gz"
# - "./sample_folder/sample1/raw_2.fq.gz"
# - "./sample_folder/sample2/raw_1.fq.gz"
# - "./sample_folder/sample2/raw_2.fq.gz"

Find_FASTQ("./sample_folder", paired = TRUE, fastq_suffix = ".fq.gz")

## End(Not run)


alexchwong/NxtIRFcore documentation built on Oct. 31, 2022, 9:14 a.m.