README.md
In RGLab/HIPCMatrix: Microarray and RNA-seq Processing for ImmuneSpace

HIPCMatrix

Utilities for processing and analyzing HIPC gene expression data in ImmuneSpace.

Processing Gene Expression Data

analysis_dir
├── analysis
│   └── exprs_matrices
│       ├── matrix_name.tsv
│       ├── matrix_name.tsv.raw
│       ├── matrix_name.tsv.summary
│       └── matrix_name.tsv.summary.orig
└── rawdata
    └── gene_expression
        ├── create-matrix
        │   └── matrix_name
        │       └── create-matrix-vars.tsv
        ├── RAW_FILES_FROM_IMMPORT
        └── supp_files
            └── sampletype_armaccession
                └── SDY1630_raw_expression.txt

Raw files from ImmPort get saved into analysis_dir/rawdata
Raw downloads from GEO or intermediate files from ImmPort are saved into analysis_dir/rawdata/supp_files/sampletype_armaccession.
Debug files get saved into create-matrix/matrix_name

prep files
1. pull files from correct location (GEO or supplementary files from ImmPort)
2. If needed, clean up to be consistent per platform
  1. illumina: probe intensity and detection p-values for all samples saved to one tsv file
  2. affymetrix: CEL files
  3. Stanford Genomics (two-color array): background-corrected expression saved to one tsv file
  4. RNA-seq: raw counts saved to one tsv file
3. Return path to cleaned file
prepare "raw" matrix: background-corrected expression for microarray; raw counts for rna-seq. Samples labeled using ImmPort biosample accession.
1. Load background-corrected or raw counts matrix from input file returned from previous step
  1. illumina: Derive background-corrected expression using limma
  2. affymetrix: Derive background-corrected expression using RMA (Result will be in log2 scale)
  3. two-color-array: Background correction already performed during prep step, so only need to read input file
  4. RNA-seq: No transformation needed. Read raw counts into one table
2. Map sample id to ImmPort biosample accession and subset to only include selected biosamples.
3. Return data.table with one row for feature, one column per biosample accession.
Normalize matrix: Normalize across samples to remove batch effects. Normalized matrix will be in log2 scale.
1. microarray: normalize across samples using quantile normalization. Illumina data must first be log2 transformed. Affy and two color array data already should be in log2 space based on raw matrix preparation
2. rna-seq: DESeq2
Summarize matrix by gene symbol. This is the same process for all platforms
Write raw, normalized, and summarized matrices to disk as tsv
Write out log and debug files

Gene Expression Analysis

Methods for downstream analysis used in ImmuneSpace are also maintained in this package. They are accessible through the HMX object, which is an R6 object that extends the ISConn object defined in ImmuneSpaceR.

To create an HMX object:

con <- HMX$new("SDY269")

You can then perform analyses on the matrices available through the connection object. For example, to run differential expression analysis:

de_results <- con$run_de_analysis()

The Immune Response Predictors method finds genes which predict immune response, with methods to fit a model to those features and compare observed vs predicted results. These methods are used in the Immune Response Predictor module in ImmuneSpace.

Predictor genes are identified using elastic net regression using the glmnet package.

The train_immune_response_predictors method will identify genes which predict immune response, measured either by HAI, Neutralizing Antibody Titer, or ELISA results. For example, to find genes whose change in expression from baseline to day 7 are associated with HAI response, and fit a model:

con <- HMX$new("SDY269")
fit <- con$train_immune_response_predictors(
    cohorts = "LAIV group 2008_PBMC",
    timepoint = 7
  )

Under the hood, this function: 1. Identifies genes differentially expressed from baseline to the selected timepoint. 1. Calculates fold-change from baseline to selected timepoint for those genes. 1. Calculates immune response using the selected assay (default is HAI) for all participants in the selected cohort. 1. Selects features whose fold-change value is predictive of HAI response using elastic net regression. 1. Fits a linear model to the data using the selected features.

You can also return just the selected features instead of the whole model with the return_type option:

features <- con$train_immune_response_predictors(
    cohorts = "LAIV group 2008_PBMC",
    timepoint = 7, 
    return_type = "features"
  )

train_immune_response_predictors also has the option to dichotomize response values using a threshold, using the dichotomize and dichotomize_thresh options. In this case, features will be selected and the model will be fit using logistic regression instead of linear.

Once a model has been fit, fitted values can be derived using the predict_response method, which returns predicted values for each participant in the selected cohort.

Feature annotation

This package provides tools for ensuring that the Gene Symbols used throughout the ImmuneSpace portal are consistent and up to date. The UpdateAnno package previously managed gene annotation in ImmuneSpace. There are a number of places where Gene Symbols play an important role: - Gene Expression Explorer - Pulls Gene Symbols from the FeatureAnnotation query - Gene Set Enrichment Analysis - Pulls Gene Symbols from the static gene set module data objects in this package, e.g. chaussabel or emory. - Differential Gene Expression Analysis - Uses the GeneExpressionAnalysisResults query to display differentially expressed genes.

When loading a new matrix onto ImmuneSpace, it must be associated with a feature annotation set, saved in the Feature Annotation Set table. If the annotation for the matrix is not already loaded from a previous study which used the same platform, you must create a new one. Each feature annotation table and the source code to create it is saved in HIPCMatrix/inst/FeatureAnnotationSetDev.

Please refer to the Notion Documentation for more details on how to create and upload a FAS.

The function create_gene_alias_map() pulls the most recent annotation from the latest HUGO Gene Nomenclature Committee dataset. This resource is used instead of the NCBI database via biomaRt or the equivalent org.Hs.eg.db as those data sources contained mappings that were deemed incorrect (such as "ACTB" > "POTEF") during the ImmuneSignatures 2 project.

This function handles 2 edge cases: 1. An alias maps to itself as a symbol as well as other symbols. In this case, we have selected to use the self-mapping and removed any other mappings. Our rationale is that the other-mapping symbols are historic artifacts and no longer accurate. 2. An alias maps to multiple symbols that do not include itself. In this case, we drop these aliases from the matrix because we do not have a good way to know which symbol is the most accurate.

This mapping table, as well as gene sets with the most current mapping, are saved in HIPCMatrix package data to be easily accessible from many locations.

The HMX object includes methods for updating annotation in ImmuneSpace, using the gene symbol mapping table installed with the package:

To update feature annotation sets associated with matrices associated with a connection:

con <- HMX$new("")
con$updateFAS()

To update summary.tsv expressionsets associated with a connection:

con$updateEMs()

To update these mappings and apply it throughout ImmuneSpace:

Source HIPCMatrix/data-raw/update_hgnc_mapping.R. This will update the HGNC mapping table which is part of the HIPCMatrix package data, then use this new mapping to update the gene signatures which are also part of the package data.
Bump package version
Commit the changes, and push to github
Install the update on the server: ssh rsT s R devtools::install("RGLab/HIPCMatrix@dev")
On the server, go to data integration module at Studies level:
- TEST: https://datatools-dev.immunespace.org/dataintegration/Studies/begin.view?
- PROD: https://datatools.immunespace.org/dataintegration/Studies/begin.view?
Run Update Anno ETL. This will:
1. Update Feature Annotation Sets with current annotation
2. Update .summary.tsv expressionsets with the current annotation
3. Re-run GSEA for all studies where it is turned on.