loadSQM: Load a SqueezeMeta project into R

View source: R/loadSQM.R

loadSQMR Documentation

Load a SqueezeMeta project into R

Description

This function takes the path to a project directory generated by SqueezeMeta (whose name is specified in the -p parameter of the SqueezeMeta.pl script) and parses the results into a SQM object. Alternatively, it can load the project data from a zip file produced by sqm2zip.py.

Usage

loadSQM(
  project_path,
  tax_mode = "prokfilter",
  trusted_functions_only = FALSE,
  single_copy_genes = "MGOGs",
  load_sequences = TRUE,
  engine = "data.table"
)

Arguments

project_path

character, a vector of project directories generated by SqueezeMeta, and/or zip files generated by sqm2zip.py.

tax_mode

character, which taxonomic classification should be loaded? SqueezeMeta applies the identity thresholds described in Luo et al., 2014. Use allfilter for applying the minimum identity threshold to all taxa, prokfilter for applying the threshold to Bacteria and Archaea, but not to Eukaryotes, and nofilter for applying no thresholds at all (default prokfilter).

trusted_functions_only

logical. If TRUE, only highly trusted functional annotations (best hit + best average) will be considered when generating aggregated function tables. If FALSE, best hit annotations will be used (default FALSE). Will only have an effect if project_path is not a zip file, and project_path/results/tables is not already present.

single_copy_genes

character, source of single copy genes for copy number normalization, either RecA (COG0468, RecA/RadA), MGOGs (COGs for 10 single copy and housekeeping genes, Salazar, G et al. 2019), MGKOs (KOs for 10 single copy and housekeeping genes, Salazar, G et al., 2019) or USiCGs (KOs for 15 single copy genes, Carr et al., 2013. Table S1). For MGOGs, MGKOs and USiCGs, the median coverage of a set of single copy genes will be used for normalization. Default MGOGs.

load_sequences

logical. If TRUE, contig and orf sequences will be loaded in the SQM object. Setting it to FALSE will reduce memory usage. Default TRUE.

engine

character. Engine used to load the ORFs and contigs tables. Either data.frame or data.table (significantly faster if your project is large). Default data.table.

Value

SQM object containing the parsed project. If more than one path is provided in project_path this function will return a SQMbunch object instead. The structure of this object is similar to that of a SQMlite object (see loadSQMlite) but with an extra entry named projects that contains one SQM object for input project. SQM and SQMbunch objects will otherwise behave similarly when used with the subset and plot functions from this package.

Prerequisites

Run SqueezeMeta! An example call for running it would be:

/path/to/SqueezeMeta/scripts/SqueezeMeta.pl
-m coassembly -f fastq_dir -s samples_file -p project_dir

The SQM object structure

The SQM object is a nested list which contains the following information:

lvl1 lvl2 lvl3 type rows/names columns data
$orfs $table dataframe orfs misc. data misc. data
$abund numeric matrix orfs samples abundances (reads)
$bases numeric matrix orfs samples abundances (bases)
$cov numeric matrix orfs samples coverages
$cpm numeric matrix orfs samples covs. / 10^6 reads
$tpm numeric matrix orfs samples tpm
$seqs character vector orfs (n/a) sequences
$tax character matrix orfs tax. ranks taxonomy
$tax16S character vector orfs (n/a) 16S rRNA taxonomy
$markers list orfs (n/a) CheckM1 markers
$contigs $table dataframe contigs misc. data misc. data
$abund numeric matrix contigs samples abundances (reads)
$bases numeric matrix contigs samples abundances (bases)
$cov numeric matrix contigs samples coverages
$cpm numeric matrix contigs samples covs. / 10^6 reads
$tpm numeric matrix contigs samples tpm
$seqs character vector contigs (n/a) sequences
$tax character matrix contigs tax. ranks taxonomies
$bins character matrix contigs bin. methods bins
$bins $table dataframe bins misc. data misc. data
$length numeric vector bins (n/a) length
$abund numeric matrix bins samples abundances (reads)
$percent numeric matrix bins samples abundances (reads)
$bases numeric matrix bins samples abundances (bases)
$cov numeric matrix bins samples coverages
$cpm numeric matrix bins samples covs. / 10^6 reads
$tax character matrix bins tax. ranks taxonomy
$taxa $superkingdom $abund numeric matrix superkingdoms samples abundances (reads)
$percent numeric matrix superkingdoms samples percentages
$phylum $abund numeric matrix phyla samples abundances (reads)
$percent numeric matrix phyla samples percentages
$class $abund numeric matrix classes samples abundances (reads)
$percent numeric matrix classes samples percentages
$order $abund numeric matrix orders samples abundances (reads)
$percent numeric matrix orders samples percentages
$family $abund numeric matrix families samples abundances (reads)
$percent numeric matrix families samples percentages
$genus $abund numeric matrix genera samples abundances (reads)
$percent numeric matrix genera samples percentages
$species $abund numeric matrix species samples abundances (reads)
$percent numeric matrix species samples percentages
$functions $KEGG $abund numeric matrix KEGG ids samples abundances (reads)
$bases numeric matrix KEGG ids samples abundances (bases)
$cov numeric matrix KEGG ids samples coverages
$cpm numeric matrix KEGG ids samples covs. / 10^6 reads
$tpm numeric matrix KEGG ids samples tpm
$copy_number numeric matrix KEGG ids samples avg. copies
$COG $abund numeric matrix COG ids samples abundances (reads)
$bases numeric matrix COG ids samples abundances (bases)
$cov numeric matrix COG ids samples coverages
$cpm numeric matrix COG ids samples covs. / 10^6 reads
$tpm numeric matrix COG ids samples tpm
$copy_number numeric matrix COG ids samples avg. copies
$PFAM $abund numeric matrix PFAM ids samples abundances (reads)
$bases numeric matrix PFAM ids samples abundances (bases)
$cov numeric matrix PFAM ids samples coverages
$cpm numeric matrix PFAM ids samples covs. / 10^6 reads
$tpm numeric matrix PFAM ids samples tpm
$copy_number numeric matrix PFAM ids samples avg. copies
$total_reads numeric vector samples (n/a) total reads
$misc $project_name character vector (empty) (n/a) project name
$samples character vector (empty) (n/a) samples
$tax_names_long $superkingdom character vector short names (n/a) full names
$phylum character vector short names (n/a) full names
$class character vector short names (n/a) full names
$order character vector short names (n/a) full names
$family character vector short names (n/a) full names
$genus character vector short names (n/a) full names
$species character vector short names (n/a) full names
$tax_names_short character vector full names (n/a) short names
$KEGG_names character vector KEGG ids (n/a) KEGG names
$KEGG_paths character vector KEGG ids (n/a) KEGG hiararchy
$COG_names character vector COG ids (n/a) COG names
$COG_paths character vector COG ids (n/a) COG hierarchy
$ext_annot_sources character vector COG ids (n/a) external databases

If external databases for functional classification were provided to SqueezeMeta via the -extdb argument, the corresponding abundance (reads and bases), coverages, tpm and copy number profiles will be present in SQM$functions (e.g. results for the CAZy database would be present in SQM$functions$CAZy). Additionally, the extended names of the features present in the external database will be present in SQM$misc (e.g. SQM$misc$CAZy_names).

Examples

## Not run: 
## (outside R)
## Run SqueezeMeta on the test data.
 /path/to/SqueezeMeta/scripts/SqueezeMeta.pl -p Hadza -f raw -m coassembly -s test.samples
## Now go into R.
library(SQMtools)
Hadza = loadSQM("Hadza") # Where Hadza is the path to the SqueezeMeta output directory.

## End(Not run)

data(Hadza) # We will illustrate the structure of the SQM object on the test data
# Which are the ten most abundant KEGG IDs in our data?
topKEGG = names(sort(rowSums(Hadza$functions$KEGG$tpm), decreasing=TRUE))[1:11]
topKEGG = topKEGG[topKEGG!="Unclassified"]
# Which functions do those KEGG IDs represent?
Hadza$misc$KEGG_names[topKEGG]
# What is the relative abundance of the Negativicutes class across samples?
Hadza$taxa$class$percent["Negativicutes",]
# Which information is stored in the orf, contig and bin tables?
colnames(Hadza$orfs$table)
colnames(Hadza$contigs$table)
colnames(Hadza$bins$table)
# What is the GC content distribution of my metagenome?
boxplot(Hadza$contigs$table[,"GC perc"]) # Not weighted by contig length or abundance!

SQMtools documentation built on April 3, 2025, 6:16 p.m.

Related to loadSQM in SQMtools...