An introduction to the isatabr package"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.dim = c(7, 4)
)
library(isatabr)
op <- options(width = 100)

The isatabr package {.unnumbered}

The isatabr package is developed as a easy-to-use package for reading, modifying and writing files in the Investigation/Study/Assay (ISA) Abstract Model of the metadata framework using the ISA tab-delimited (TAB) format.

ISA is a metadata framework to manage an increasingly diverse set of life science, environmental and biomedical experiments that employ one or a combination of technologies. Built around the Investigation (the project context), Study (a unit of research) and Assay (analytical measurements) concepts, ISA helps you to provide rich descriptions of experimental metadata (i.e. sample characteristics, technology and measurement types, sample-to-data relationships) so that the resulting data and discoveries are reproducible and reusable.

The ISA tab structure

The ISA-Tab structure is described in full detail on the ISA-tab website{.uri}. The description below is mostly taken from there and slightly condensed when appropriate.

ISA-Tab uses three types of file to capture the experimental metadata:

The Investigation file contains all the information needed to understand the overall goals and means used in an experiment; experimental steps (or sequences of events) are described in the Study and in the Assay file(s). For each Investigation file there may be one or more Studies defined with a corresponding Study file; for each Study there may be one or more Assays defined with corresponding Assay files.

In order to facilitate identification of ISA-Tab component files, specific naming patterns should be followed:

The Investigation file

The Investigation file fulfills four needs:

  1. to declare key entities, such as factors, protocols, which may be referenced in the other files;
  2. to track provenance of the terminologies (controlled vocabularies or ontologies) there are used, where applicable;
  3. to relate each Study file to an Investigation (this only becomes necessary when two or more Study files need to be grouped);
  4. to relate Assay files to Studies.

An Investigation file is structured as a table with vertical headings along the first column, and corresponding values in the subsequent columns. The following section headings must appear in the Investigation file (in order), and the study block (headings from STUDY to STUDY CONTACTS) can be repeated, one block per study associated with the investigation.

For a full description of all sections see the aforementioned site.

The Study file

The Study file contains contextualizing information for one or more assays, for example; the subjects studied; their source(s); the sampling methodology; their characteristics; and any treatments or manipulations performed to prepare the specimens.

The Assay file

The Assay file represents a portion of the experimental graph (i.e., one part of the overall structure of the workflow); each Assay file must contain assays of the same type, defined by the type of measurement (e.g. gene expression) and the technology employed (e.g. DNA microarray). Assay-related information includes protocols, additional information relating to the execution of those protocols and references to data files (whether raw or derived).

Example data

As an example for working with the isatabr package we will use the data set that accompanies @Atwell2010. The associated files are included in the package.

Reading files in the ISA-Tab format

ISA-Tab files can be stored in two different ways, either as separate files in a directory, or as .zip file containing the files. The example data is included in both ways in the package. Both formats can be read into R using the readISATab function.

When reading ISA-Tab files from a directory, only the name of the directory, where the ISA-TAB files are located, needs to be specified.

## Read ISA-Tab files from directory.
isaObject1 <- readISATab(path = file.path(system.file("extdata/Atwell", package = "isatabr")))

When reading zipped files, both the directory, where the zip-file is located, and the name of the file need to be specified.

## Read ISA-Tab files from directory.
isaObject2 <- readISATab(path = file.path(system.file("extdata", package = "isatabr")),
                         zipfile = "Atwell.zip")

In both cases readISATab will automatically detect the Investigation, Study and Assay files assuming the naming conventions described in the previous section are followed. If this is not the case, the function will give an error indicating the problem. The imported ISA-Tab files are stored in an object of the S4 class ISA. Since the information is almost identical for reading files from a directory and zipped-files, the following sections will show the example for the files read from a directory only.

Accessing and updating ISA objects.

All information from the ISA-Tab files is stored within slots in the ISA object. The table below gives an overview of the different slots and a brief description of the information stored in the slot. For a more exhaustive description see help("ISA"). Note that an investigation may have multiple studies. Therefore, data concerning studies is stored in a list object, where one element in the list corresponds to one study. Likewise a study may consist of multiple assays and assay data is stored in a list object, where one element in the list corresponds to one assay.

| Slot | Type | Description | |-----------|-------------------------|----------------------------------------------------------| | path | character | path to the ISA-Tab files | | iFileName | character | name of the investigation file | | oSR | data.frame | ONTOLOGY SOURCE REFERENCE section of investigation file | | invest | data.frame | INVESTIGATION section of investigation file | | iPubs | data.frame | INVESTIGATION PUBLICATIONS section of investigation file | | iContacts | data.frame | INVESTIGATION CONTACTS section of investigation file | | study | list of data.frames | STUDY sections of investigation file | | sDD | list of data.frames | STUDY DESIGN DESCRIPTORS sections of investigation file | | sPubs | list of data.frames | STUDY PUBLICATIONS sections of investigation file | | sFacts | list of data.frames | STUDY FACTORS sections of investigation file | | sAssays | list of data.frames | STUDY ASSAYS sections of investigation file | | sProts | list of data.frames | STUDY PROTOCOLS sections of investigation file | | sContacts | list of data.frames | STUDY CONTACTS sections of investigation file | | sFiles | list of data.frames | content of study files | | aFiles | list of data.frames | content of assay files |

All slots have corresponding functions for accessing and modifying information. The names of these access functions are the same as the slots they refer to, e.g. accessing the iFileName slot in an ISA object can be done using the iFileName() function. There is one exception to this. To prevent problems with the path() function, that already exists in quite some other packages, the path slot in an ISA object should be accessed using the isaPath() function.

## Access path for isaObjects
isaPath(isaObject1)
isaPath(isaObject2)

The path for isaObject1 shows the directory from which the files were read. As isaObject2 was read directly for a zipped archive, the files were first extracted into a temporary folder and subsequently read from there. This temporary folder is shown as the path.

The other slots are accessible in a similar way. Some more examples are shown below.

## Access studies.
isaStudies <- study(isaObject1)

## Print study names.
names(isaStudies)

## Access study descriptors.
isaSDD <- sDD(isaObject1)

## Shows study descriptors for study GMI_Atwell_study.
isaSDD$GMI_Atwell_study

It is not only possible to access the different slots in an ISA object, the slots can also be updated. As the access function, the update functions have the same name as the slots they refer to. As an example, let's assume an error sneaked into the ONTOLOGY SOURCE REFERENCE section and we want to update one of the source versions.

First have a look at the current content of the ONTOLOGY SOURCE REFERENCE section.

(isaOSR <- oSR(isaObject1))

Now we update the version of the OBI ontology source from 23 to 24. Then we update the modified ontology source data.frame in the ISA object.

## Update version number.
isaOSR[1, "Term Source Version"] <- 24

## Update oSR in ISA object.
oSR(isaObject1) <- isaOSR

## Check the updated oSR.
oSR(isaObject1)

In a similar way all slots in an ISA object can be accessed and updated.

Processing assay files

The assay files may contain information about the files used to store the actual data for the assay. Per assay file two types of data files may be referred to: 1) the file(s) containing the raw data, and 2) the file(s) containing derived data.

Looking at the assay tab file in our example data, we see that the Raw Data File column is empty, no raw data files are available. However, the Derived Data File shows the file d_data.txt.

## Inspect assay tab.
isaAFile <- aFiles(isaObject1)
head(isaAFile$a_study1.txt)

To read the contents of the data files, either raw or derived, in the assay tab file, we can use the processAssay() function. The exact working of this function depends on the technology type of the assay. For most technology types the data files are read as plain .txt files assuming a tab-delimited format. Only for mass spectrometry and microarray data the files are read differently (see the sections below). As the output above shows, the assay file in the example has a Data Transformation technology and is therefore read as tab-delimited file.

Before being able to process the assay file, i.e. read the data, we first have to extract the assay tabs using the getAssayTabs() function. This function extracts all the assay files from an ISA object and stores them as assayTab objects. These assayTab objects contain not only the content of the assay tab file, but also extra information, e.g. technology type.

## Get assay tabs for isaObject1.
aTabObjects <- getAssayTabs(isaObject1)

## Process assay data.
isaDat <- processAssay(isaObject = isaObject1,
                       aTabObject = aTabObjects$s_study1.txt$a_study1.txt,
                       type = "derived")

## Display first rows and columns.
head(isaDat[, 1:10])

The data is now stored in isaDat and can be used for further analysis within R.

Mass spectrometry assay files

Mass spectrometry data is often stored in Network Common Data Form (NetCDF) files, i.e. in .CDF files. Assay data containing these data will be processed in a different way than regular assay data. To be able to do this the xcms package is required. This package is available from Bioconductor.

As an example for the processing of mass spectrometry files we will use a subset of the quantitated LC/MS peaks from the spinal cords of 6 wild-type and 6 fatty acid amide hydrolase (FAAH) knockout mice described in @Saghatelian2004. A more extensive version of this data set is available in the faahKO{.uri} data package on Bioconductor.

## Read ISA-Tab files for faahKO.
isaObject3 <- readISATab(path = file.path(system.file("extdata/faahKO", package = "isatabr")))

After reading the ISA-Tab files, we can now process the mass spectrometry assay data. In this example the raw data is available, so when processing the assay we specify type = "raw". The rest of the code is similar to the previous section.

## Get assay tabs for isaObject3.
aTabObjects3 <- getAssayTabs(isaObject3)

## Process assay data.
isaDat3 <- processAssay(isaObject = isaObject3,
                        aTabObject = aTabObjects3$s_Proteomic_profiling_of_yeast.txt$a_metabolite.txt,
                        type = "raw")

## Display output.
isaDat3

As the output shows, processing the mass spectrometry data gives an object of class xcmsSet from the xcms package. This object contains all available information from the .CDF file that was read and can be used for further analysis.

Microarray assay files

Microarray data is often stored in an Affymetrix Probe Results file. These .CEL files contain information on the probe set's intensity values, and a probe set represents a gene. Assay data containing these data will be processed in a different way than regular assay data. To be able to do this the affy package is required. This package is available from Bioconductor.

Processing microarray data is done in a very similar way as processing mass spectrometry data, as described in the previous section. The main difference is that the resulting object will in this case be an object object of class ExpressionSet, which is used as input in many Bioconductor packages.

Writing files in the ISA-Tab format.

After updating an ISA object, it can be written back to a directory using the writeISAtab() function. All content of the ISA object will be written to investigation, study and assay files following the ISA-Tab standard for file specification. By default the files are written to the current working directory, but the directory can be specified using the path argument.

## Write content of ISA object to a temporary directory.
writeISAtab(isaObject = isaObject1, 
            path = tempdir())

Note that existing files are always overwritten. Therefore, writing files to the same directory, from where the original files were read, will result in the original files being overwritten.

Besides writing the full ISA object it is also possible to write only the investigation file , one or more study files or one or more assay files.

## Write investigation file.
writeInvestigationFile(isaObject = isaObject1,
                       path = tempdir())

## Write study file.
writeStudyFiles(isaObject = isaObject1,
                studyFilenames = "s_study1.txt",
                path = tempdir())

## Write assay file.
writeAssayFiles(isaObject = isaObject1,
                assayFilenames = "a_study1.txt",
                path = tempdir())
options(op)

References



Try the isatabr package in your browser

Any scripts or data that you put into this service are public.

isatabr documentation built on Aug. 19, 2022, 5:17 p.m.