SIAMCAT input files formats

Introduction

This vignette illustrates how to read and input your own data to the SIAMCAT package. We will cover reading in text files from the disk, formatting them and using them to create an object of siamcat-class.

The siamcat-class is the centerpiece of the package. All of the input data and result are stored inside of it. The structure of the object is described below in the siamcat-class object section.

Loading your data into R

SIAMCAT input

Generally, there are three types of input for SIAMCAT:

Features

The features should be a matrix, a data.frame, or an otu_table, organized as follows:

features (in rows) x samples (in columns).

| | Sample_1 | Sample_2 | Sample_3 | Sample_4 | Sample_5 | | --- | ---:| ---:| ---:| ---:| ---:| | Feature_1 | 0.59 | 0.71 | 0.78 | 0.61 | 0.66 | | Feature_2 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | | Feature_3 | 0.02 | 0.00 | 0.00 | 0.00 | 0.20 | | Feature_4 | 0.34 | 0.00 | 0.13 | 0.07 | 0.00 | | Feature_5 | 0.06 | 0.16 | 0.00 | 0.00 | 0.00 |

Please note that SIAMCAT is supposed to work with relative abundances. Other types of data (e.g. counts) will also work, but not all functions of the package will result in meaningful outputs.

An example of a typical feature file is attached to the SIAMCAT package, containing data from a publication investigating the microbiome in colorectal cancer (CRC) patients and controls (the study can be found here: Zeller et al). The metagenomics data were processed with the MOCAT pipeline, returning taxonomic profiles on the species levels (specI):

library(SIAMCAT)
fn.in.feat  <- system.file(
    "extdata",
    "feat_crc_zeller_msb_mocat_specI.tsv",
    package = "SIAMCAT"
)

One way to load such data into R could be the use of read.table

(Beware of the defaults in R! They are not always useful...)

feat <- read.table(fn.in.feat, sep='\t',
    header=TRUE, quote='',
    stringsAsFactors = FALSE, check.names = FALSE)
# look at some features
feat[110:114, 1:2]

Metadata

The metadata should be either a matrix or a data.frame.

samples (in rows) x metadata (in columns):

| | Age | Gender | BMI | | --- | ---:| ---:| ---:| | Sample_1 | 52 | 1 | 20| | Sample_2 | 37 | 1 | 18 | | Sample_3 | 66 | 2 | 24 | | Sample_4 | 54 | 2 | 26 | | Sample_5 | 65 | 2 | 30 |

The rownames of the metadata should match the colnames of the feature matrix.

Again, an example of such a file is attached to the SIAMCAT package, taken from the same study:

fn.in.meta  <- system.file(
    "extdata",
    "num_metadata_crc_zeller_msb_mocat_specI.tsv",
    package = "SIAMCAT"
)

Also here, the read.table can be used to load the data into R.

meta <- read.table(fn.in.meta, sep='\t',
    header=TRUE, quote='',
    stringsAsFactors = FALSE, check.names = FALSE)
head(meta)

Label

Finally, the label can come in different three different flavours:

For our example dataset, we can create the label object out of the metadata column called diagnosis:

label <- create.label(meta=meta, label="diagnosis",
    case = 1, control=0)

When we later plot the results, it might be nicer to have names for the different groups stored in the label object (instead of 1 and 0). We can also supply them to the create.label function:

label <- create.label(meta=meta, label="diagnosis",
    case = 1, control=0,
    p.lab = 'cancer', n.lab = 'healthy')
label$info

Note:
If you have no label information for your dataset, you can still create a SIAMCAT object from your features alone. The SIAMCAT object without label information will contain a TEST label that can be used for making holdout predictions. Other functions, e.g. model training, will not work on such an object.

LEfSe format files

LEfSe is a tool for identification of associations between micriobial features and up to two metadata. LEfSe uses LDA (linear discriminant analysis).

LEfSe input file is a .tsv file. The first few rows contain the metadata. The following row contains sample names and the rest of the rows are occupied by features. The first column holds the row names:

| label | healthy | healthy | healthy | cancer | cancer | | --- | ---:| ---:| ---:| ---:| ---:| | age | 52 | 37 | 66 | 54 | 65 | | gender | 1 | 1 | 2 | 2 | 2 | |Sample_info | Sample_1 | Sample_2 | Sample_3 | Sample_4 | Sample_5 | | Feature_1 | 0.59 | 0.71 | 0.78 | 0.61 | 0.66 | | Feature_2 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | | Feature_3 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | | Feature_4 | 0.34 | 0.00 | 0.43 | 0.00 | 0.00 | | Feature_5 | 0.56 | 0.56 | 0.00 | 0.00 | 0.00 |

An example of such a file is attached to the SIAMCAT package:

fn.in.lefse<- system.file(
    "extdata",
    "LEfSe_crc_zeller_msb_mocat_specI.tsv",
    package = "SIAMCAT"
)

SIAMCAT has a dedicated function to read LEfSe format files. The read.lefse function will read in the input file and extract metadata and features:

meta.and.features <- read.lefse(fn.in.lefse,
    rows.meta = 1:6, row.samples = 7)
meta <- meta.and.features$meta
feat <- meta.and.features$feat

We can then create a label object from one of the columns of the meta object and create a siamcat object:

label <- create.label(meta=meta, label="label", case = "cancer")

metagenomeSeq format files

metagenomeSeq is an R package to determine differentially abundant features between multiple samples.

There are two ways to input data into metagenomeSeq:

a) two files, one for metadata and one for features - those can be used in SIAMCAT just like described in SIAMCAT input with read.table:

fn.in.feat  <- system.file(
    "extdata",
    "CHK_NAME.otus.count.csv",
    package = "metagenomeSeq"
)
feat <- read.table(fn.in.feat, sep='\t',
    header=TRUE, quote='', row.names = 1,
    stringsAsFactors = FALSE, check.names = FALSE
)

b) BIOM format file, that can be used in SIAMCAT as described in the following section

BIOM format files

The BIOM format files can be added to SIAMCAT via phyloseq. First the file should be imported using the phyloseq function import_biom. Then a phyloseq object can be imported as a siamcat object as descibed in the next section.

Creating a siamcat object of a phyloseq object

The siamcat object extends on the phyloseq object. Therefore, creating a siamcat object from a phyloseq object is really straightforward. This can be done with the siamcat constructor function. First, however, we need to create a label object:

data("GlobalPatterns") ## phyloseq example data
label <- create.label(meta=sample_data(GlobalPatterns),
    label = "SampleType",
    case = c("Freshwater", "Freshwater (creek)", "Ocean"))
# run the constructor function
siamcat <- siamcat(phyloseq=GlobalPatterns, label=label)

Creating a siamcat-class object

The siamcat-class is the centerpiece of the package. All of the is stored inside of the object: internal make-up of a siamcat object

In the figure above, rectangles depict slots of the object and the class of the object stored in the slot is given in the ovals. There are two obligatory slots -phyloseq (containing the metadata as sample_data and the original features as otu_table) and label - marked with thick borders.

The siamcat object is constructed using the siamcat() function. There are two ways to initialize it:

Please note that you have to provide either feat or phyloseq and that you cannot provide both.

In order to explain the siamcat object better we will show how each of the slots is filled.

phyloseq, label and orig_feat slots

The phyloseq and label slots are obligatory.

The phyloseq, label and orig_feat are filled when the siamcat object is first created with the constructor function. construction

All the other slots

Other slots are filled during the run of the SIAMCAT workflow: workflow

Accessing and assigning slots

Each slot in siamcat can be accessed by typing

slot_name(siamcat)

e.g. for the eval_data slot you can types

eval_data(siamcat)

There is one notable exception: the phyloseq slot has to be accessed with physeq(siamcat) due to technical reasons.

Slots will be filled during the SIAMCAT workflow by the package's functions. However, if for any reason a slot needs to be assigned outside of the workflow, the following formula can be used:

slot_name(siamcat) <- object_to_assign

e.g. to assign a new_label object to the label slot:

label(siamcat) <- new_label

Please note that this may lead to unforeseen consequences...

Slots inside the slots

There are two slots that have slots inside of them. First, the model_list slot has a models slot that contains the actual list of mlr models -can be accessed via models(siamcat)- and model.type which is a character with the name of the method used to train the model: model_type(siamcat).

The phyloseq slot has a complex structure. However, unless the phyloseq object is created outside of the SIAMCAT workflow, only two slots of phyloseq slot will be occupied: the otu_table slot containing the features table and the sam_data slot containing metadata information. Both can be accessed by typing either features(siamcat) or meta(siamcat).

Additional slots inside the phyloseq slots do not have dedicated accessors, but can easily be reached once the phyloseq object is exported from the siamcat object:

phyloseq <- physeq(siamcat)
tax_tab <- tax_table(phyloseq)
head(tax_tab)

If you want to find out more about the phyloseq data structure, head over to the phyloseq BioConductor page.

Session Info

sessionInfo()


Try the SIAMCAT package in your browser

Any scripts or data that you put into this service are public.

SIAMCAT documentation built on Nov. 8, 2020, 5:14 p.m.