message("Loading knitr library") if (!require("knitr", quietly = TRUE)) { message("Installing knitr library") install.packages("knitr", dependencies = TRUE) } require(knitr) ## Default parameters for displaying the slides knitr::opts_chunk$set( echo = TRUE, eval = FALSE, fig.width = 7, fig.height = 5, fig.align = "center", encoding = "UTF-8", fig.path = "figures/", size = "tiny", warning = FALSE, results = TRUE, message = FALSE, comment = "")
```{bash eval=FALSE} ssh [login]@core.cluster.france-bioinformatique.fr
## Shared space for a project Our shared space is in this flder: `/shared/projects/rnaseqmva/`. All the declared participants to the project (you and me) have read and write access to the whole folder. BEWARE: if you remove things, they are lost! We need to submit all the code to github regularly. ## Management of the git clone for the RNAseqMVA project ### I imported a clone of the RNAseqMVA package there This needs to be done only once, please don't redo it! ```{bash eval=FALSE} cd /shared/projects/rnaseqmva/ git clone https://github.com/elqumsan/RNAseqMVA.git
Anytime we start working, we first do a git pull in order to be sure to have the latest version.
```{bash eval=FALSE} cd /shared/projects/rnaseqmva/RNAseqMVA ## go to the shared project folder git pull ## update the local copy of RNAeqMVA git status ## check the status
********* ## Running the workflow on the RStudio server at IFB-core ### Connecting to the RStudio server In a Web brower, open a connection to **<https://rstudio.cluster.france-bioinformatique.fr/>** ### Opening a session of the RNAseqMVA project In the R console, type the following command. ```r setwd("/shared/projects/rnaseqmva/RNAseqMVA/")
In the Files tab of the bottom-right pane, click on the function More > Go to Working Directory.
Double-click on the file RNAseqMVA.Rproj
.
After having done this, you should in principle see a Git tab in the top-right pane (the location of this pane depends on your RStudio configuration).
Open the Build tab in the Environment, History, Connection, Build, Git tab.
Click Install and Restart.
Note: the first time you run this, you might be prompted to install libraries.
One option -- prudent but somewhat cumbersome -- is to install manually the libraries that are absolutely required to compile the package.
An alternative -- a bit tricky but convenient -- is to run the script misc/01a_load_libraries.R
of the package, which automatically installs all the CRAN and BioConductor libraries used in the package.
setwd("/shared/projects/rnaseqmva/RNAseqMVA/") source("misc/01a_load_libraries.R")
We had to install manually two packages to handle some dependencies:
We use a YAML file to configure the workflow.
In RStudio, open the file ~/RNAseqMVA/misc/00_project_parameters.yml
, check the parameters and
An alternative to RStudio is to run the workflow directly on the cluster nodes, via the ssh
connection and with an interactive slurm session (sinteractive
).
BEWARE R should never run on the cluster mother machine. It should always run on a node. This can be achieved in different ways.
sinteractive
Rscript
command to slurm via srun
.Storing several Rscript
calls in a bash file, which can then be run via sbatch
.
Difficulties:
Advantages:
```{bash eval=FALSE} ssh [login]@core-cluster.france-bioinformatique.fr
module load conda
cd /shared/projects/rnaseqmva/RNAseqMVA
### Installing or updating the rnaseqmva conda environment If this is the first time you use RNAseqMVA, you will need to install the conda environment, which will automatically install all the software required to run your analyses. This is done with a single command, but it can take time since it will install a specific version of R + all the required librairies and dependencies. ```{bash eval=FALSE} ## The first time only conda env install -f conda-rnaseqmva.yml
If the conda environment has been changed since your last session, you can update it in order to make sure you dispose of all thee new requirements.
```{bash eval=FALSE}
conda env update -f conda-rnaseqmva.yml
### Loading the conda environment At each session, you need to activate tbe `rnaseqmva` conda environment. ```{bash eval=FALSE} module load conda conda activate rnaseqmva
It may be useful to update the RNAseqMVA pakage from the github repository.
```{bash eval=FALSE} cd /shared/projects/rnaseqmva/RNAseqMVA git pull
### Compiling the RNAseqMVA package ```{bash eval=FALSE} ## If not done before, load the required conda environment module load conda conda activate rnaseqmva ## Go to the package-enclosing directory cd /shared/projects/rnaseqmva ## (Re)build the package R CMD build RNAseqMVA ## Check the package and doc (note: this takes some time) R CMD check RNAseqMVA ## Install the rebuilt package R CMD INSTALL --no-multiarch --with-keep.source RNAseqMVA
Note: sinteractive
blocks resources on a node as long as the user does not close it. This mode should thus be used only for particular cases which require live interactions with R (e.g. testing a script with a given study case before sending it to the job scheduler for all the study cases, or debugging).
If you want to send the execution of a script to the job scheduler, skip this section and look for the srun
and sbatch
sections below.
```{bash eval=FALSE} sinteractive --mem=48GB cd /shared/projects/rnaseqmva/RNAseqMVA
The `sinteractive` session automatically opens a `screen` environment, which enables you to create several tabs (e.g. one for the editor, a second one for the unix shell and a third one for the R session) and swap between them for your work. This requires to be familiar with `screen`. First, edit the file `misc/00_project_parameters.yml` with a text editor, in order to select the study case + all the parameters. In one of the `screen` tabs, open an R session ```{bash eval=FALSE} R
and run the following command.
source("misc/main_processes.R")
srun Rscript
I created a script named srun_analysis.sh
, with the following content.
```{bash eval=FALSE}
cd /shared/projects/rnaseqmva/RNAseqMVA
WORKSPACE=/shared/projects/rnaseqmva/RNAseqMVA_workspace
LOG_DIR=${WORKSPACE}/logs
mkdir -p ${LOG_DIR}
START_DATE=date +%Y-%m-%d_%H%M%S
RECOUNT_ID=SRP042620 FEATURE=transcript PREFIX=${RECOUNT_ID}${FEATURE}${START_DATE} CPUS=50 MEM=32GB
echo "RECOUNT_ID: ${RECOUNT_ID}" echo "FEATURE: ${FEATURE}" echo "PREFIX: ${PREFIX}" echo "CPUS: ${CPUS}" echo "MEM: ${MEM}"
srun --mem=${MEM} --cpus=${CPUS} --partition=fast \ --output ${LOG_DIR}/${PREFIX}_out.txt \ --error ${LOG_DIR}/${PREFIX}_err.txt \ Rscript --vanilla misc/main_processes.R ${RECOUNT_ID} ${FEATURE}
This script can be used to send a job to slurm. However, running it directly is not ideal because srun stays active during the execution of the process, so that in case we quit the session the job is dead. To circumvent this, we run the script via `sbatch`. ```{bash eval=FALSE} ## Change directory to RNAseqMVA package cd /shared/projects/rnaseqmva/RNAseqMVA ## Initiate the environment ## This is necessary for the sbatch command ! module load conda # conda init bash conda activate rnaseqmva ## Send the script to the job scheduler sbatch --mem=32GB --cpus=50 --partition=long srun_analysis.sh ## Check your slurm jobs squeue -u ${UID}
Julien Seiler wrote a script (adapted by JvH) to send all the jobs to slurm via a jobarray, with the appropriate tuning of parameters (memory, nodes).
```{bash eval=FALSE}
module load conda conda init bash conda activate rnaseqmva
mkdir -p slurm_logs
RECOUNT_IDS=(SRP035988 SRP042620 SRP056295 SRP057196 SRP061240 SRP062966 SRP066834) FEATURE_TYPES=(gene transcript)
recount_index=$((SLURM_ARRAY_TASK_ID / 2)) RECOUNT_ID=${RECOUNT_IDS[recount_index]} feature_index=$((SLURM_ARRAY_TASK_ID % 2)) FEATURE_TYPE=${FEATURE_TYPES[feature_index]}
cd /shared/projects/rnaseqmva/RNAseqMVA
WORKSPACE=/shared/projects/rnaseqmva/RNAseqMVA_workspace
LOG_DIR=${WORKSPACE}/logs
mkdir -p ${LOG_DIR}
START_DATE=date +%Y-%m-%d_%H%M%S
PREFIX=${RECOUNT_ID}${FEATURE_TYPE}${START_DATE}
echo "${START_DATE} ${SLURM_ARRAY_TASK_ID} ${SLURM_ARRAY_JOB_ID} ${RECOUNT_ID} ${FEATURE_TYPE} ${LOG_DIR}/${PREFIX}" >> srun_jobs_sent.tsv
srun --mem=32GB --cpus=50 \ --output ${LOG_DIR}/${PREFIX}_out.txt \ --error ${LOG_DIR}/${PREFIX}_err.txt \ Rscript --vanilla misc/main_processes.R ${RECOUNT_ID} ${FEATURE_TYPE}
This can then be sent to slurm job scheduler with the `sbatch`command, but we first need to define the appropriate environment. ```{bash eval=FALSE} ## Change directory to RNAseqMVA package cd /shared/projects/rnaseqmva/RNAseqMVA ## Initiate the environment ## This is necessary for the sbatch command ! module load conda # conda init bash conda activate rnaseqmva ## Send the script to the job scheduler sbatch srun_jobarray.sh
```{bash eval=FALSE}
squeue -u $UID
MY_JOBS=squeue -u $UID | grep -v JOBID | awk '{print $1}'
sstat $MY_JOBS
#### Useful slurm commands | Command | Description| |------------|--------------------------------| | `sinfo` | view information about Slurm nodes and partitions | | `srun`| | | `sbatch`| | | `sinteractive` | | | `squeue` | view information about jobs located in the Slurm scheduling queue | | `sstat` | Display various status information of a running job/step | | `sacct` | Display accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database | | `scancel`| cancel the execution of a given job | ### Following the execution logs ```{bash eval=FALSE} cd /shared/projects/rnaseqmva/RNAseqMVA LOG_DIR=/shared/projects/rnaseqmva/logs ls -ltr ${LOG_DIR}
By default, RNAseqMVA creates a working directory in your home folder (~/RNAseqMVA_workspace
). However we want to put the working directory in our shared space.
We could change the parameter in the YAML configuration file, but this would impose the same directory for all the other places where the package is running (which would not be convenient for our laptops and for the distribution to other users).
To circumvent this, we will create a directory in the shared space, and create a soft link from our home directory to this shared space.
```{bash eval=FALSE}
ssh [login]@core.cluster.france-bioinformatique.fr
ln -fs /shared/projects/rnaseqmva ~/rnaseqmva_shared_space
ln -fs /shared/projects/rnaseqmva/RNAseqMVA ~/RNAseqMVA
mkdir -p /shared/projects/rnaseqmva/RNAseqMVA_workspace
ln -fs /shared/projects/rnaseqmva/RNAseqMVA_workspace ~/RNAseqMVA_workspace
From now on we can go to the working directory with this ```{bash eval=FALSE} cd ~/RNAseqMVA_workspace
This is equivalent to
```{bash eval=FALSE} cd /shared/projects/rnaseqmva/RNAseqMVA_workspace
You can list the files present in this directory ```{bash eval=FALSE} ls -l
You can also see the organisation of all the files with the very convenient Unix command tree
.
```{bash eval=FALSE} tree
## Running the workflow In the **File** tab, open the file `misc/main_processes.R` and run it step by step. ## Checking the server load with htop The Unix command `htop`provides a very convenient way to check the server load. - Besides the R **Console** tab, click on the **Terminal** tab. - Click on the rectangle to maximise this pane. - At the terminal prompt, type the command `htop` ## Mounting the shared space on your computer In order to access the results, one possibility is to mount the shared disk space on your computer via the `sshfs` protocol. On Mac OS X we use Fuse to run this protocol. For other OS we still need to investigate. ```{bash mounting_remote_workspace, eval=FALSE} ## Create a mount point on your local device export MOUNT_POINT=~/mnt/RNAseqMVA mkdir -p ${MOUNT_POINT} ## We then mount the remote disk (the shared space on IFB core cluster) ## on our local mount point export SHARED_DIR=/shared/projects/rnaseqmva/ export IFB_LOGIN=jvanhelden ## this is for Jacques export IFB_LOGIN=mabuelqumsan ## this is for Mustafa sshfs -o allow_other,defer_permissions \ ${IFB_LOGIN}@core.cluster.france-bioinformatique.fr:${SHARED_DIR} \ ${MOUNT_POINT}
Be patient: the mounting may take a few seconds (but not minutes). After that, you can run the following command to check the content of the remote folder that has been mounted on your local computer.
```{bash checking_mounted_disk, eval=FALSE}
ls -ltr ${MOUNT_POINT}/
You can also check the free space remaing on the remote disk. ```{bash disk_free_mounted_folder, eval=FALSE} ## Check the free space on the remote hard drive that contains ## the shared folder (on the IFB core cluster) df -h ${MOUNT_POINT}
For th sake of comparison, compare it with the disk free result on your own computer.
```{bash disk_free_local_home, eval=FALSE}
df -h ${HOME}
### Handling the files on the locally mounted remote shared space Once the shared space has been mounted on your local mount point, you can use different commands to list the files, handle them, or make a local copy. We can for instance use the `find`command to list all the pdf files found in the result folder (we named this folder `RNAseqMVA_workspace`). ```{bash eval=FALSE} ## Find all files with extension .pdf in the remote folder find ${MOUNT_POINT}/RNAseqMVA_workspace/results -type f -name '*.pdf'
We can also refine the search, by selecting one study case (e.g. SRP042620) and feature type (e.g. gene) and searching the pdf files in the corresponding sub-folder of the workspace.
```{bash eval=FALSE}
export STUDY_CASE=SRP042620 export FEATURE_TYPE=gene find ${MOUNT_POINT}/RNAseqMVA_workspace/results/${STUDY_CASE}_${FEATURE_TYPE} -type f -name '*.pdf'
For convenience, we can also create a local copy of this sub-folder, but only synchronise the pdf files. The interest is to make these files easy to access, without too much cost in disk space. ```{bash sync_pdf_one_study_case, eval=FALSE} ## Synchronize all the figure files of the selected study case in a local folder, with the same organisation of sub-folders as in the source directory export LOCAL_FOLDER=~/RNAseqMVA_selected_figures mkdir -p ${LOCAL_FOLDER} find ${MOUNT_POINT}/RNAseqMVA_workspace/results/${STUDY_CASE}_${FEATURE_TYPE} \ -type f -name '*.pdf' -exec rsync -ruptvl -R {} $LOCAL_FOLDER \; echo "Local folder: ${LOCAL_FOLDER}" ## Check the disk use of the local copy (which only contains the pdf files) du -sh ${LOCAL_FOLDER}
In principle you should now be able to open a given pdf on your local copy.
On Mac OS X, we can use the very convenient command open
, whcih takes as argument one or several file paths, and opens them with the appropriate software.
```{bash eval=FALSE}
find $LOCAL_FOLDER -name '*.pdf' | grep ${STUDY_CASE}_${FEATURE_TYPE} | xargs open
We can now generalize the command and make a local mirror of all the pdf files from all the study cases. ```{bash eval=FALSE} ## Synchronise all the figure files from ALL the study cases on your local folder find ${MOUNT_POINT}/RNAseqMVA_workspace/results \ -type f -name '*.pdf' -exec rsync -ruptvl -R {} $LOCAL_FOLDER \; echo "Local folder: ${LOCAL_FOLDER}" ## Check the disk use of the local copy (which only contains the pdf files) du -sh ${LOCAL_FOLDER}
Very important: after your session, I recomend to unmount the remote disk with the following command.
```{bash eval=FALSE}
umount -f ${MOUNT_POINT}
```
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.