knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)

Run the nextflow pipeline

This is a bit of a misnomer -- nextflow is simply the language. The pipeline is from nf-co.re, which is a highly curated repository for nextflow pipelines. There is only one pipeline for any given task, meaning there will not be 10 different rnaseq_pipelines. Rather, all development for rnaseq is happening in one spot.

There are two options -- you can read the parameter docs to figure out how to input your data, or you can click the "launch version #.#" button in the header of page. There is a form you can fill out which will create the parameters.json document, and also tell you what the command will look like.

Here is an example of what submitting to the pipeline looks like

Sample sheet

sample fastq_1                                                               fastq_2 strandedness
269    /lts/mblab/Crypto/rnaseq_data/lts_sequence/run_4178_samples/Brent_J... ""      reverse     
270    /lts/mblab/Crypto/rnaseq_data/lts_sequence/run_4040_samples/Brent_s... ""      reverse     
278    /lts/mblab/Crypto/rnaseq_data/lts_sequence/run_4040_samples/Brent_s... ""      reverse     
306    /lts/mblab/Crypto/rnaseq_data/lts_sequence/run_1658_samples/1658_Br... ""      reverse     
307    /lts/mblab/Crypto/rnaseq_data/lts_sequence/run_1658_samples/1658_Br... ""      reverse     
308    /lts/mblab/Crypto/rnaseq_data/lts_sequence/run_1658_samples/1658_Br... ""      reverse     

Move the fastq files to scratch

For those computational people, what you need to do is move the files from lts to scratch. Use your favorite method.

If you wish to use my methods, here is how I do it:

In the brentlabRnaSeqTools there is a function called moveNfCoFastqFiles() which creates a two column dataframe of structure column1: source, column2: destination. I write that to the cluster and then use the bash script below to move the files. Here is my full process, using crypto as an example:

prefix = "/lts/mblab/Crypto/rnaseq_data/lts_sequence"

lookup = moveNfCoFastqFiles(sample_sheet, prefix)

# from my local with the cluster mounted
write_tsv(lookup, "/scratch/mblab/chasem/rnaseq_pipeline/lookups/my_lookup.tsv")

I then go over to the cluster and use a script called moveFiles.sh. It contains:

#!/bin/bash

nrow=$(wc -l $1 | grep -oP "^[[:digit:]]+")


for i in $(seq 1 $nrow); do
    read s d < <(sed -n ${i}p $1)

    dest_dir=$(dirname $d) 

    mkdir -p $dest_dir

    echo "copying $1 of $nrow"

    rsync -aHv $s $d

done

So from the command line I do this:

./job_scripts/moveFiles.sh lookups/my_lookup.tsv

params

NOTE! This is not necessarily a "standard" parameter file -- in particular, the feature_group_type is " and we are skipping the biotype_qc. This is necessary for crypto since there is no biotype information in the GTF. This should not be the case, however, certainly will not be true if you are processing human and mouse data.

{
    "input": "\/scratch\/mblab\/chasem\/rnaseq_pipeline\/daniel_set\/daniel_sample.csv",
    "outdir": ".\/daniel_set_results",
    "gtf": "/scratch\/mblab/chasem\/rnaseq_pipeline\/genome_files_curr\/KN99\/liftoff_h99_to_kn99.gtf",
    "fasta": "\/scratch\/mblab/chasem\/rnaseq_pipeline\/genome_files_curr\/KN99\/KN99_genome_fungidb.fasta",
    "gtf_extra_attributes": "ID",
    "featurecounts_group_type": "",
    "rseqc_modules": "bam_stat,inner_distance,infer_experiment,junction_annotation,junction_saturation,read_distribution,read_duplication",
    "skip_bigwig": true,
    "skip_biotype_qc": true
}

nextflow command

#!/bin/bash

#SBATCH --time=15:00:00  # right now, 15 hours. change depending on time expectation to run
#SBATCH --mem-per-cpu=5G
#SBATCH -J your_run_name
#SBATCH -o ./your_output_name.out

ml miniconda
ml singularity

source activate /path/to/your/conda_env/nextflow

work_folder_name=work_today

config_path=/scratch/mblab/$USER/configs/conf/wustl_htcf.config

param_file=/scratch/mblab/$USER/param.json

# yes -- keep this here
mkdir -p tmp

# note! -r 3.3 is the 'revision' or version of the nf-co/rnaseq pipeline you wish to use.
# You should use the most up to date version, unless you are holding the version constant
# for consistency across a set

nextflow run nf-core/rnaseq -r 3.3 \
                            -work-dir ${PWD}/${work_folder_name} \
                            -c ${config_path} \
                            -params-file ${param_file}

the Resume function

Oh no! your pipeline errored half way through. Do the following:

  1. read the error message carefully and see if you can fix the issue on your own. For example, are all your input paths correct?
  2. save the error message and send it to someone else.
  3. get on the nf-co/rnaseq_pipeline slack channel (you'll find the link on their site). They are very helpful, and part of the reason we switched to this pipeline is to get access to a wider community of bioinformatians

When you think you ahve resolved the issue, just add the -resume flag to your command like so and resubmit:

#!/bin/bash

#SBATCH --time=15:00:00  # right now, 15 hours. change depending on time expectation to run
#SBATCH --mem-per-cpu=5G
#SBATCH -J your_run_name
#SBATCH -o ./your_output_name.out

ml miniconda
ml singularity

source activate /path/to/your/conda_env/nextflow

work_folder_name=work_today

config_path=/scratch/mblab/$USER/configs/conf/wustl_htcf.config

param_file=/scratch/mblab/$USER/param.json

# yes -- keep this here
mkdir -p tmp

# note! -r 3.3 is the 'revision' or version of the nf-co/rnaseq pipeline you wish to use.
# You should use the most up to date version, unless you are holding the version constant
# for consistency across a set

nextflow run nf-core/rnaseq -r 3.3 \
                            -work-dir ${PWD}/${work_folder_name} \
                            -c ${config_path} \
                            -params-file ${param_file} \
                            -resume # see how easy that is!


cmatKhan/brentlabRnaSeqTools documentation built on Nov. 17, 2021, 5:47 a.m.