create_data: Create a 'BANDITS_data' object

View source: R/create_data.R

create_dataR Documentation

Create a 'BANDITS_data' object

Description

create_data imports the equivalence classes and create a 'BANDITS_data' object.

Usage

create_data(
  salmon_or_kallisto,
  gene_to_transcript,
  salmon_path_to_eq_classes = NULL,
  kallisto_equiv_classes = NULL,
  kallisto_equiv_counts = NULL,
  kallisto_counts = NULL,
  eff_len,
  n_cores = NULL,
  transcripts_to_keep = NULL,
  max_genes_per_group = 50
)

Arguments

salmon_or_kallisto

a character string indicating the input data: 'salmon' or 'kallisto'.

gene_to_transcript

a matrix or data.frame with a list of gene-to-transcript correspondances. The first column represents the gene id, while the second one contains the transcript id.

salmon_path_to_eq_classes

(for salmon input only) a vector of length equals to the number of samples: each element indicates the path to the equivalence classes of the respective sample (computed by salmon).

kallisto_equiv_classes

(for kallisto input only) a vector of length equals to the number of samples: each element indicates the path to the equivalence classes ('.ec' files) of the respective sample (computed by kallisto).

kallisto_equiv_counts

(for kallisto input only) a vector of length equals to the number of samples: each element indicates the path to the counts of the equivalence classes ('.tsv' files) of the respective sample (computed by kallisto).

kallisto_counts

(for kallisto input only) a matrix or data.frame, with 1 column per sample and 1 row per transcript, containing the estimated abundances for each transcript in each sample, computed by kallisto. The matrix must be unfiltered and the order or rows must be unchanged.

eff_len

a vector containing the effective length of transcripts; the vector names indicate the transcript ids. Ideally, created via eff_len_compute.

n_cores

the number of cores to parallelize the tasks on. It is highly suggested to use at least one core per sample (default if not specificied by the user).

transcripts_to_keep

a vector containing the list of transcripts to keep. Ideally, created via filter_transcripts.

max_genes_per_group

an integer number specifying the maximum number of genes that each group can contain. When equivalence classes contain transcripts from distinct genes, these genes are analyzed together. For computational reasons, 'max_genes_per_group' sets a limit to the number of genes that each group can contain.

Value

A BANDITS_data object.

Author(s)

Simone Tiberi simone.tiberi@uzh.ch

See Also

eff_len_compute, filter_transcripts, filter_genes, BANDITS_data

Examples

# specify the directory of the internal data:
data_dir = system.file("extdata", package = "BANDITS")

# load gene_to_transcript matching:
data("gene_tr_id", package = "BANDITS")

# Specify the directory of the transcript level estimated counts.
sample_names = paste0("sample", seq_len(4))
quant_files = file.path(data_dir, "STAR-salmon", sample_names, "quant.sf")

# Load the transcript level estimated counts via tximport:
library(tximport)
txi = tximport(files = quant_files, type = "salmon", txOut = TRUE)
counts = txi$counts

# Optional (recommended): transcript pre-filtering
transcripts_to_keep = filter_transcripts(gene_to_transcript = gene_tr_id,
                                         transcript_counts = counts,
                                         min_transcript_proportion = 0.01,
                                         min_transcript_counts = 10,
                                         min_gene_counts = 20)

# compute the Median estimated effective length for each transcript:
eff_len = eff_len_compute(x_eff_len = txi$length)

# specify the path to the equivalence classes:
equiv_classes_files = file.path(data_dir, "STAR-salmon", sample_names, "aux_info", "eq_classes.txt")

# create data from 'salmon' and filter internally lowly abundant transcripts:
input_data = create_data(salmon_or_kallisto = "salmon",
                         gene_to_transcript = gene_tr_id,
                         salmon_path_to_eq_classes = equiv_classes_files,
                         eff_len = eff_len, 
                         n_cores = 2,
                         transcripts_to_keep = transcripts_to_keep)
input_data

# create data from 'kallisto' and filter internally lowly abundant transcripts:
kallisto_equiv_classes = file.path(data_dir, "kallisto", sample_names, "pseudoalignments.ec")
kallisto_equiv_counts  = file.path(data_dir, "kallisto", sample_names, "pseudoalignments.tsv")

input_data_2 = create_data(salmon_or_kallisto = "kallisto",
                          gene_to_transcript = gene_tr_id,
                          kallisto_equiv_classes = kallisto_equiv_classes,
                          kallisto_equiv_counts = kallisto_equiv_counts,
                          kallisto_counts = counts,
                          eff_len = eff_len, n_cores = 2,
                          transcripts_to_keep = transcripts_to_keep)
input_data_2


SimoneTiberi/BANDITS documentation built on Nov. 15, 2023, 2:35 p.m.