ASE:

Files needed for the analysis

Reference genome

For example, GRCm38_68.fa.

wget ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa

gtf annotation for the reference genome

For example, Mus_musculus.GRCm38.68.gtf.gz for corresponding reference genome version.

wget ftp://ftp.ensembl.org/pub/release-68/gtf/mus_musculus/Mus_musculus.GRCm38.68.gtf.gz

Either:
One/Two vcf files for maternal and paternal imbred lines (should be compatible with the reference genome)
Joint vcf file for multiple species, where two lines are presented (should be compatible with reference genome)
Individual vcf file or joint individuals vcf file (should be compatible with reference genome) (see correponding section about reference preparation)

For example, mgp.v5.merged.snps_all.dbSNP142.vcf.gz for multiple mice lines.

wget ftp://ftp-mouse.sanger.ac.uk/current_snps/mgp.v5.merged.snps_all.dbSNP142.vcf.gz

Note: you can do vcf calling by using mutect, varscan, etc. if you heve no pre-existing vcf

Input preprocessing before anything:

Reference genome fasta should be indexed (samtools faidx).
vcf-files should be compressed (bgzip) and indexed (samtools tabix).

Reference preparation:

One script to rule them all:

python3 /full/path/to/ASE/python/prepare_reference_tmp.py --PSEUDOREF True --HETVCF True \
  --pseudoref_dir /full/path/to/dir/for/pseudo/ref/out/ \
  --vcf_dir /full/path/todir/for/vcf/outputs/ \
  --ref /full/path/to/GRCm38_68.fa \
  --name_mat 129S1_SvImJ --name_pat CAST_EiJ \
  --vcf_joint mgp.v5.merged.snps_all.dbSNP142.vcf.gz \
  --gtf /n/scratch2/sv111/ASE/Mus_musculus.GRCm38.68.gtf

For help:

python3 /home/am717/ASE/python/prepare_reference_tmp.py --help

--PSEUDOREF True

Input:

| | Inbred lines | Inbred lines | Individual | Individual | | --- | --- | --- | --- | --- | | | Joint lines vcf | Separate line(s) vcf | Joint individuals vcf | Separate individual vcf | | FASTA Reference genome | --ref | --ref | --ref | --ref | | VCF Variant file(s)[1] | --vcf_joint | --vcf_mat, --vcf_pat | --vcf_joind | --vcf_ind | | Name(s) | --name_mat, --name_pat | --name_mat, --name_pat | --name_ind | --name_ind | | FASTA Output directory | --pseudo_dir | --pseudo_dir | --pseudo_dir | --pseudo_dir | | VCF Output directory | --vcf_dir | --vcf_dir | --vcf_dir | --vcf_dir |

[1] If one or the alleles in case of inbred lines is reference, then everything should be provided as mat or pat only, consistently.

Output:
Pseudoreference genome fastas with own directories.
Support vcf or bed files (if needed).

--HETVCF True

Input:

| | Inbred lines | Inbred lines | Individual | Individual | | --- | --- | --- | --- | --- | | | Joint lines vcf | Separate line(s) vcf | Joint individuals vcf | Separate individual vcf | | FASTA Reference genome | --ref | --ref | --ref | --ref | | nothing or GTF or BED Selected regions annotation [2] | --gtf or --bed | --gtf or --bed | --gtf or --bed | --gtf or --bed | | VCF Variant file(s)[1] | --vcf_joint | --vcf_mat, --vcf_pat | --vcf_joind | --vcf_ind | | Name(s) | --name_mat, --name_pat | --name_mat, --name_pat | --name_ind | --name_ind | | VCF Output directory | --vcf_dir | --vcf_dir | --vcf_dir | --vcf_dir |

[1] If one or the alleles in case of inbred lines is reference, then everything should be provided as mat or pat only, consistently. [2] Bed will be considered as main; if gtf provided, automatically considered regions=exons and groups=genes; bed file should have 4 columns and prepared in advance: contig, start position, end position, group ID (no colnames).

Output:
VCF with heterozygous positions, one allele as reference and the second as alternative.
Support vcf files (if needed).

RNA-seq preparation:

Make sure that your fasta files are ready for the STAR alignment step, each of them should be indexed with STAR (STAR --runMode genomeGenerate).

pseudoRefDirs=/full/path/to/dir/for/pseudo/ref/out/
STAR --runMode genomeGenerate --genomeDir $pseudoRefDirs/129S1_SvImJ/ --genomeFastaFiles $pseudoDir/129S1_SvImJ/129S1_SvImJ_pseudo.fa
STAR --runMode genomeGenerate --genomeDir $pseudoRefDirs/CAST_EiJ/ --genomeFastaFiles $pseudoDir/CAST_EiJ/CAST_EiJ_pseudo.fa

Make sure that you have unziped (gunzip) sample fasta/fastq files.

Make sure that you have: * bed file (with four columns: contig, start and end positions, group ID; no column names) with selected regions, for example, exon regions grouped by genes:

awk '$3=="exon" && ($1 ~ /^[1-9]/ || $1 == "X" || $1 == "Y")' /full/path/to/Mus_musculus.GRCm38.68.gtf | cut -f1,4,5,9 | awk -F $'\t' 'BEGIN {OFS = FS} {split($4, a, "gene_id \""); split(a[2], b, "\""); print $1, $2-1, $3, b[1]}' > /full/path/to/output/Mus_musculus.GRCm38.68.EXONS.bed

snp table, which can be obtained from vcf with heterozigous positions created above, as 5 first columns, for example:

grep "^#CHROM" /full/path/to/Het_Allelic_129S1_SvImJ_CAST_EiJ.exons.vcf | cut -f1-5 > /full/path/to/Het_Allelic_129S1_SvImJ_CAST_EiJ.snp_table.txt
grep -v "^#" /full/path/to/Het_Allelic_129S1_SvImJ_CAST_EiJ.exons.vcf | sort -V | cut -f1-5 >> /full/path/to/Het_Allelic_129S1_SvImJ_CAST_EiJ.snp_table.txt

gimelbrantlab/ASE documentation built on March 1, 2020, 5:41 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com