In jhrcook/HotNetvieweR: Read HotNet2 Results and Convert to Graph object

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(HotNetvieweR)

This package was made to bridge the gap between the raw output of HotNet2 and further visualization and analysis. This vignette describes how I ran the example provided by the Raphael group on the HotNet2 GitHub repository.

Setup

The first step is to clone the HotNet2 repository from GitHub and set-up the environment to use it in.

I began by making a specific directory for this example analysis.

mkdir HotNet2_example
cd HotNet2_example

I then cloned the GitHub repository.

git clone https://github.com/raphael-group/hotnet2.git

The HotNet2 tool was written in Python 2.7. Therefore, I setup a virtual environment, activated it, and then installed the necessary libraries (listed in the "requirements.txt" file).

# create virtual environment
virtualenv venv-hotnet2

# activate virtual environment
source venv-hotnet2/bin/activate

# install the libraries for HotNet2
cd hotnet2
pip install -r requirements.txt

To increase the speed of using HotNet2, the team implemented some of the algorithms (namely the edge-swapping algorithm) in Fortran and C. These two commands pre-compile the code. They produce a few warnings, but no errors.

python hotnet2/setup_fortran.py build_src build_ext --inplace
python hotnet2/setup_c.py build_src build_ext --inplace

Heat Files

The heat files contains the starting "heat" for each node. The input data needs to follow the correct format, so make sure to take a look at "hotnet2/example/example.heat" to see what it should look like.

The heat file is first transformed into a JSON in a specific format for HotNet2. This uses the makeHeatFile.py file; the help information is shown below.

python makeHeatFile.py --help

#> usage: makeHeatFile.py [-h] {scores,mutation,oncodrive,mutsig,music} ...
#> 
#> Generates a JSON heat file for input to runHotNet2.
#> 
#> optional arguments:
#>   -h, --help            show this help message and exit
#> 
#> Heat score type:
#>   {scores,mutation,oncodrive,mutsig,music}
#>     scores              Pre-computed heat scores
#>     mutation            Mutation data
#>     oncodrive           Oncodrive scores
#>     mutsig              MutSig scores
#>     music               MuSiC scores

The makeHeatFile.py script can take several types of input. Here, I will demonstrate two of them. The first is scores; below is the help information.

python makeHeatFile.py scores --help

#> usage: makeHeatFile.py scores [-h] [-o OUTPUT_FILE] [-n NAME] -hf HEAT_FILE
#>                               [-ms MIN_HEAT_SCORE] [-gff GENE_FILTER_FILE]
#> 
#> optional arguments:
#>   -h, --help            show this help message and exit
#>   -o OUTPUT_FILE, --output_file OUTPUT_FILE
#>                         Output file. If none given, output will be written to
#>                         stdout.
#>   -n NAME, --name NAME  Name/Label describing the heat scores.
#>   -hf HEAT_FILE, --heat_file HEAT_FILE
#>                         Path to a tab-separated file containing a gene name in
#>                         the first column and the heat score for that gene in
#>                         the second column of each line.
#>   -ms MIN_HEAT_SCORE, --min_heat_score MIN_HEAT_SCORE
#>                         Minimum heat score for genes to have their original
#>                         heat score in the resulting output file. Genes with
#>                         score below this value will be assigned score 0.
#>   -gff GENE_FILTER_FILE, --gene_filter_file GENE_FILTER_FILE
#>                         Path to file listing genes whose heat scores should be
#>                         preserved, one per line. If present, all other heat
#>                         scores will be discarded.

Using the above information, I wrote the following command to make the JSON heat file from "example/example.heat" and saved it to "example_run/heatfiles/scores_heatfile.json".

python makeHeatFile.py scores \
  --heat_file example/example.heat \
  --name scoresheat \
  --output_file example_run/heatfiles/scores_heatfile.json

#> * Loading heat scores for 25 genes

Here it was the resulting JSON looked like.

{
    "heat": {
        "gene8": 4.0,
        "gene9": 2.0,
        "gene1": 15.0,
        "gene2": 6.0,
        "gene3": 5.0,
        "gene4": 1.0,
        "gene5": 3.0,
        "gene6": 2.0,
        "gene7": 1.0,
        "gene12": 1.0,
        "gene13": 1.0,
        "gene10": 5.0,
        "gene11": 1.0,
        "gene16": 1.0,
        "gene17": 1.0,
        "gene14": 1.0,
        "gene15": 1.0,
        "gene18": 1.0,
        "gene19": 1.0,
        "gene23": 1.0,
        "gene22": 1.0,
        "gene21": 1.0,
        "gene20": 1.0,
        "gene25": 1.0,
        "gene24": 1.0
    },
    "parameters": {
        "name": "HN2example",
        "gene_filter_file": null,
        "heat_file": "example/example.heat",
        "output_file": "example/example_heatfile.json",
        "min_heat_score": 0,
        "heat_fn": "load_direct_heat"
    }
}

Below, I show the help information and the command I used to create a help file using CNA and SNV information.

python makeHeatFile.py mutation --help

#> usage: makeHeatFile.py mutation [-h] [-o OUTPUT_FILE] [-n NAME] --snv_file
#>                                 SNV_FILE [--cna_file CNA_FILE]
#>                                 [--sample_file SAMPLE_FILE]
#>                                 [--sample_type_file SAMPLE_TYPE_FILE]
#>                                 [--gene_file GENE_FILE] [--min_freq MIN_FREQ]
#>                                 [--cna_filter_threshold CNA_FILTER_THRESHOLD]
#> 
#> optional arguments:
#>   -h, --help            show this help message and exit
#>   -o OUTPUT_FILE, --output_file OUTPUT_FILE
#>                         Output file. If none given, output will be written to
#>                         stdout.
#>   -n NAME, --name NAME  Name/Label describing the heat scores.
#>   --snv_file SNV_FILE   Path to a tab-separated file containing SNVs where the
#>                         first column of each line is a sample ID and
#>                         subsequent columns contain the names of genes with
#>                         SNVs in that sample. Lines starting with "#" will be
#>                         ignored.
#>   --cna_file CNA_FILE   Path to a tab-separated file containing CNAs where the
#>                         first column of each line is a sample ID and
#>                         subsequent columns contain gene names followed by
#>                         "(A)" or "(D)" indicating an amplification or deletion
#>                         in that gene for the sample. Lines starting with "#"
#>                         will be ignored.
#>   --sample_file SAMPLE_FILE
#>                         File listing samples. Any SNVs or CNAs in samples not
#>                         listed in this file will be ignored. If HotNet is run
#>                         with mutation permutation testing, all samples in this
#>                         file will be eligible for random mutations even if the
#>                         sample did not have any mutations in the real data. If
#>                         not provided, the set of samples is assumed to be all
#>                         samples that are provided in the SNV or CNA data.
#>   --sample_type_file SAMPLE_TYPE_FILE
#>                         File listing type (e.g. cancer, datasets, etc.) of
#>                         samples (see --sample_file). Each line is a space-
#>                         separated row listing one sample and its type. The
#>                         sample types are used for creating the HotNet(2) web
#>                         output.
#>   --gene_file GENE_FILE
#>                         File listing tested genes. SNVs or CNAs in genes not
#>                         listed in this file will be ignored. If HotNet is run
#>                         with mutation permutation testing, every gene in this
#>                         file will be eligible for random mutations even if the
#>                         gene did not have mutations in any samples in the
#>                         original data. If not provided, the set of tested
#>                         genes is assumed to be all genes that have mutations
#>                         in either the SNV or CNA data.
#>   --min_freq MIN_FREQ   The minimum number of samples in which a gene must
#>                         have an SNV to be considered mutated in the heat score
#>                         calculation.
#>   --cna_filter_threshold CNA_FILTER_THRESHOLD
#>                         Proportion of CNAs in a gene across samples that must
#>                         share the same CNA type in order for the CNAs to be
#>                         included. This must either be > .5, or the default,
#>                         None, in which case all CNAs will be included.

python makeHeatFile.py mutation \
  --snv_file example/example.snv \
  --cna_file example/example.cna \
  --name mutationsheat \
  --output_file example_run/heatfiles/mutations_heatfile.json

#> * Calculating heat scores for 9 genes in 10 samples at min frequency 1

Network Files

The next step is to prepare the real and permuted network files using makeNetworkFiles.py. Note that this step can take a very long time on a personal computer and a real PPI. This example data only takes a few seconds to run, though.

Below I show the help information for the makeNetworkFiles.py script.

python makeNetworkFiles.py --help

#> usage: makeNetworkFiles.py [-h] -e EDGELIST_FILE -i GENE_INDEX_FILE -nn
#>                            NETWORK_NAME -p PREFIX [-is INDEX_FILE_START_INDEX]
#>                            -b BETA [-op] [-q Q] [-ps PERMUTATION_START_INDEX]
#>                            [-np NUM_PERMUTATIONS] -o OUTPUT_DIR [-c CORES]
#> 
#> Create the personalized pagerank matrix and 100 permuted PPR matrices for
#> thegiven network and restart probability beta.
#> 
#> optional arguments:
#>   -h, --help            show this help message and exit
#>   -e EDGELIST_FILE, --edgelist_file EDGELIST_FILE
#>                         Path to TSV file listing edges of the interaction
#>                         network, whereeach row contains the indices of two
#>                         genes that are connected in thenetwork.
#>   -i GENE_INDEX_FILE, --gene_index_file GENE_INDEX_FILE
#>                         Path to tab-separated file containing an index in the
#>                         first columnand the name of the gene represented at
#>                         that index in the secondcolumn of each line.
#>   -nn NETWORK_NAME, --network_name NETWORK_NAME
#>                         Name of network.
#>   -p PREFIX, --prefix PREFIX
#>                         Output prefix.
#>   -is INDEX_FILE_START_INDEX, --index_file_start_index INDEX_FILE_START_INDEX
#>                         Minimum index in the index file.
#>   -b BETA, --beta BETA  Beta is the restart probability for the insulated heat
#>                         diffusion process.
#>   -op, --only_permutations
#>                         Only permutations, i.e., do not generate influence
#>                         matrix forobserved data. Useful for generating
#>                         permuted network files onmultiple machines.
#>   -q Q, --Q Q           Edge swap constant. The script will attempt Q*|E| edge
#>                         swaps
#>   -ps PERMUTATION_START_INDEX, --permutation_start_index PERMUTATION_START_INDEX
#>                         Index at which to start permutation file names.
#>   -np NUM_PERMUTATIONS, --num_permutations NUM_PERMUTATIONS
#>                         Number of permuted networks to create.
#>   -o OUTPUT_DIR, --output_dir OUTPUT_DIR
#>                         Output directory.
#>   -c CORES, --cores CORES
#>                         Use given number of cores. Pass -1 to use all
#>                         available.

I then use it on the two provided edge lists, "example/example_edgelist.txt" and "example/example_edgelist2.txt". The output of each are saved to different directories. Note that I just used 0.5 for the beta without trying different values. There is method for programmatically deciding on a value for the $\beta$ of the RWR (described in the paper's Supplementary), though it doesn't seem to be too important to get a high-precision value (judging from personal experience and the author's comments in the Supplementary).

python makeNetworkFiles.py \
  --edgelist_file example/example_edgelist.txt \
  --gene_index_file example/example_gene_index.txt \
  --network_name network1 \
  --prefix network1 \
  --beta 0.5 \
  --num_permutations 100 \
  --output_dir example_run/network1 \
  --cores 1

#> Creating PPR matrix for real network
#> --------------------------------------
#> 
#> Creating edge lists for permuted networks
#> -------------------------------------------
#> * Loading edge list..
#>  - 31 edges among 25 nodes.
#>  - No. swaps to attempt = 3565.0
#> * Creating permuted networks...
#> 100/100
#> * Avg. No. Swaps Made: 1842
#> 
#> Creating PPR matrices for permuted networks
#> ---------------------------------------------
#> 100/100%

python makeNetworkFiles.py \
  --edgelist_file example/example_edgelist2.txt \
  --gene_index_file example/example_gene_index.txt \
  --network_name network2 \
  --prefix network2 \
  --beta 0.5 \
  --num_permutations 100 \
  --output_dir example_run/network2 \
  --cores 1

#>Creating PPR matrix for real network
#>--------------------------------------
#>
#>Creating edge lists for permuted networks
#>-------------------------------------------
#>* Loading edge list..
#>  - 30 edges among 25 nodes.
#>  - No. swaps to attempt = 3450.0
#>* Creating permuted networks...
#>100/100
#>* Avg. No. Swaps Made: 1820
#>
#>Creating PPR matrices for permuted networks
#>---------------------------------------------
#>100/100%

Here is what the file structure for the first network looks like.

ll example_run/network1

#> total 32
#> -rw-r--r--    1 admin  staff    12K Oct 12 18:02 network1_ppr_0.5.h5
#> drwxr-xr-x  202 admin  staff   6.3K Oct 12 18:02 permuted

Run HotNet2

Running HotNet2 is done using the HotNet2.py script. Below is the help information.

python HotNet2.py --help

#> usage: HotNet2.py [-h] -nf [NETWORK_FILES [NETWORK_FILES ...]] -pnp
#>                   [PERMUTED_NETWORK_PATHS [PERMUTED_NETWORK_PATHS ...]] -hf
#>                   [HEAT_FILES [HEAT_FILES ...]] [-ccs MIN_CC_SIZE]
#>                   [-d [DELTAS [DELTAS ...]]] [-np NETWORK_PERMUTATIONS]
#>                   [-cp CONSENSUS_PERMUTATIONS] [-hp HEAT_PERMUTATIONS] -o
#>                   OUTPUT_DIRECTORY [-c NUM_CORES] [-dsf DISPLAY_SCORE_FILE]
#>                   [-dnf DISPLAY_NAME_FILE] [--output_hierarchy]
#>                   [--verbose {0,1,2,3,4}]
#> 
#> Helper script for simple runs of generalized HotNet2, including
#> automatedparameter selection.
#> 
#> optional arguments:
#>   -h, --help            show this help message and exit
#>   -nf [NETWORK_FILES [NETWORK_FILES ...]], --network_files [NETWORK_FILES [NETWORK_FILES ...]]
#>                         Path to HDF5 (.h5) file containing influence matrix
#>                         and edge list.
#>   -pnp [PERMUTED_NETWORK_PATHS [PERMUTED_NETWORK_PATHS ...]], --permuted_network_paths [PERMUTED_NETWORK_PATHS #> [PERMUTED_NETWORK_PATHS ...]]
#>                         Path to influence matrices for permuted networks, one
#>                         path per network file. Include ##NUM## in the path to
#>                         be replaced with the iteration number
#>   -hf [HEAT_FILES [HEAT_FILES ...]], --heat_files [HEAT_FILES [HEAT_FILES ...]]
#>                         Path to heat file containing gene names and scores.
#>                         This can eitherbe a JSON file created by
#>                         generateHeat.py, in which case the filename must end
#>                         in .json, or a tab-separated file containing a
#>                         genename in the first column and the heat score for
#>                         that gene in thesecond column of each line.
#>   -ccs MIN_CC_SIZE, --min_cc_size MIN_CC_SIZE
#>                         Minimum size connected components that should be
#>                         returned.
#>   -d [DELTAS [DELTAS ...]], --deltas [DELTAS [DELTAS ...]]
#>                         Delta value(s).
#>   -np NETWORK_PERMUTATIONS, --network_permutations NETWORK_PERMUTATIONS
#>                         Number of permutations to be used for delta parameter
#>                         selection.
#>   -cp CONSENSUS_PERMUTATIONS, --consensus_permutations CONSENSUS_PERMUTATIONS
#>                         Number of permutations to be used for consensus
#>                         statistical significance testing.
#>   -hp HEAT_PERMUTATIONS, --heat_permutations HEAT_PERMUTATIONS
#>                         Number of permutations to be used for statistical
#>                         significance testing.
#>   -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
#>                         Output directory. Files results.json, components.txt,
#>                         andsignificance.txt will be generated in
#>                         subdirectories for each delta.
#>   -c NUM_CORES, --num_cores NUM_CORES
#>                         Number of cores to use for running permutation tests
#>                         in parallel. If-1, all available cores will be used.
#>   -dsf DISPLAY_SCORE_FILE, --display_score_file DISPLAY_SCORE_FILE
#>                         Path to a tab-separated file containing a gene name in
#>                         the firstcolumn and the display score for that gene in
#>                         the second column ofeach line.
#>   -dnf DISPLAY_NAME_FILE, --display_name_file DISPLAY_NAME_FILE
#>                         Path to a tab-separated file containing a gene name in
#>                         the firstcolumn and the display name for that gene in
#>                         the second column ofeach line.
#>   --output_hierarchy    Output the hierarchical decomposition of the HotNet2
#>                         similarity matrix.
#>   --verbose {0,1,2,3,4}
#>                         Set verbosity of output (minimum: 0, maximum: 5).

The first run of HotNet2 that I show in this vignette just runs one heat file on one network. This is the simplest use-case for HotNet2. It is likely to be satisfactory for most users.

python HotNet2.py \
  --network_files example_run/network1/network1_ppr_0.5.h5 \
  --permuted_network_paths example_run/network1/permuted/network1_ppr_0.5_##NUM##.h5 \
  --heat_files example_run/heatfiles/mutations_heatfile.json \
  --output_directory example_run/output_simple \
  --num_cores 1
#> /Users/admin/Documents/Python/HotNet2_example/hotnet2/hotnet2/hnio.py:402: H5pyDeprecationWarning: dataset.value has been deprecated. Use dataset[()] instead.
#>   dictionary = {key:f[key].value for key in f}
#> * Running HotNet2 in consensus mode...
#>  - network1 mutationsheat
#> * Outputting results to file...
#> * Generating and outputting visualization data...

The second example run, shown below, runs two heat files on two networks. HotNet2 then creates a "consensus" output for the two networks.

python HotNet2.py \
  --network_files example_run/network1/network1_ppr_0.5.h5 \
                  example_run/network2/network2_ppr_0.5.h5 \
  --permuted_network_paths example_run/network1/permuted/network1_ppr_0.5_##NUM##.h5 \
                           example_run/network2/permuted/network2_ppr_0.5_##NUM##.h5 \
  --heat_files example_run/heatfiles/mutations_heatfile.json \
               example_run/heatfiles/scores_heatfile.json \
  --output_directory example_run/output_consensus \
  --num_cores 1

#> /Users/admin/Documents/Python/HotNet2_example/hotnet2/hotnet2/hnio.py:402: H5pyDeprecationWarning: dataset.value has been deprecated. Use dataset[()] instead.
#>   dictionary = {key:f[key].value for key in f}
#> * Running HotNet2 in consensus mode...
#>  - network1 mutationsheat
#>  - network1 scoresheat
#>  - network2 mutationsheat
#>  - network2 scoresheat
#> * Outputting results to file...
#> * Generating and outputting visualization data...

Output File Structure

Here is a diagram showing the file structure for the simple output.

#> tree example_run/output_simple
#> example_run/output_simple
#> ├── consensus
#> │   ├── stats.tsv
#> │   ├── subnetworks.json
#> │   ├── subnetworks.tsv
#> │   └── viz-data.json
#> └── network1-mutationsheat
#>     ├── delta_0.007907676044851542
#>     │   ├── components.txt
#>     │   ├── results.json
#>     │   └── significance.txt
#>     ├── delta_1.024861412588507e-05
#>     │   ├── components.txt
#>     │   ├── results.json
#>     │   └── significance.txt
#>     └── viz-data.json
#> 
#> 4 directories, 11 files

The output for the run with multiple networks is similar in structure but there are more files.

tree example_run/output_consensus
#> example_run/output_consensus
#> ├── consensus
#> │   ├── stats.tsv
#> │   ├── subnetworks.json
#> │   ├── subnetworks.tsv
#> │   └── viz-data.json
#> ├── network1-mutationsheat
#> │   ├── delta_0.007907676044851542
#> │   │   ├── components.txt
#> │   │   ├── results.json
#> │   │   └── significance.txt
#> │   ├── delta_1.024861412588507e-05
#> │   │   ├── components.txt
#> │   │   ├── results.json
#> │   │   └── significance.txt
#> │   └── viz-data.json
#> ├── network1-scoresheat
#> │   ├── delta_0.03640584065578878
#> │   │   ├── components.txt
#> │   │   ├── results.json
#> │   │   └── significance.txt
#> │   ├── delta_0.04944752808660269
#> │   │   ├── components.txt
#> │   │   ├── results.json
#> │   │   └── significance.txt
#> │   ├── delta_0.052240293473005295
#> │   │   ├── components.txt
#> │   │   ├── results.json
#> │   │   └── significance.txt
#> │   ├── delta_0.10117589682340622
#> │   │   ├── components.txt
#> │   │   ├── results.json
#> │   │   └── significance.txt
#> │   └── viz-data.json
#> ├── network2-mutationsheat
#> │   ├── delta_0.005027415044605733
#> │   │   ├── components.txt
#> │   │   ├── results.json
#> │   │   └── significance.txt
#> │   ├── delta_5.711058292945381e-06
#> │   │   ├── components.txt
#> │   │   ├── results.json
#> │   │   └── significance.txt
#> │   └── viz-data.json
#> └── network2-scoresheat
#>     ├── delta_0.051236389204859734
#>     │   ├── components.txt
#>     │   ├── results.json
#>     │   └── significance.txt
#>     ├── delta_0.05293082073330879
#>     │   ├── components.txt
#>     │   ├── results.json
#>     │   └── significance.txt
#>     ├── delta_0.05503008887171745
#>     │   ├── components.txt
#>     │   ├── results.json
#>     │   └── significance.txt
#>     ├── delta_0.10069134272634983
#>     │   ├── components.txt
#>     │   ├── results.json
#>     │   └── significance.txt
#>     └── viz-data.json
#> 
#> 17 directories, 44 files