Intro

Data

Mail from Cédric.

Les Long Reads :

/work/mzahm/Project_Pangafish/RawData/*.fastq.gz

Les reads Hi-C :

/work/mzahm/Project_Pangafish/RawData/HiC/*.fastq.gz

Les reads 10X :

/work/mzahm/Project_Pangafish/RawData/Run_PANGA1G.13800/Analyse_DemultiplexData.53847/*.fastq.gz

Le génome assemblé LR :

/work/project/sigenae/Christophe/Panga_longranger/map_to_scaffolds/genome.fa

Le bam Longranger :

/work/project/sigenae/Christophe/Panga_longranger/map_to_scaffolds/ALIGN/outs/possorted_bam.bam

Le fichier de comptage des tags 10X par bin de 10k :

/work/project/sigenae/Christophe/Panga_longranger/map_to_scaffolds/possorted-tagcmp.tsv.gz

Le fichier “contact map” simulé :

/work/project/sigenae/Christophe/Panga_longranger/map_to_scaffolds/genome.mnd.txt

Le fichier .hic généré à partir des données Hi-C :

/work/project/sigenae/Christophe/Panga_longranger/map_to_scaffolds/genome.final.10X.hic

Le fichier .hic généré à partir des données 10X :

/work/project/sigenae/Christophe/Panga_longranger/map_to_scaffolds/genome.final.Hi-C.hic

Le fichier .assembly du génome assemblé avec LR+Hi-C pour le chargement dans Juicebox :

/work/project/sigenae/Christophe/Panga_longranger/map_to_scaffolds/genome.final.assembly

La procédure utilisée pour générer un fichier .hic à partir de données 10X :

http://genoweb.toulouse.inra.fr/~sigenae/10X_and_Juicebox.html

Mother 1 - Unpolished contigs: /work2/project/seqoccin/assemblies/assemblers/nanopore/bos_taurus/wtdbg2_trio1_mother_37165_run123/mother_raw.cgt.fa - ONT: /work2/project/seqoccin/data/reads/nanopore/bos_taurus/trio1.mother.run?.fastq.gz - Hi-C: /work2/project/seqoccin/assemblies/scaffolders/hic/bos_taurus/juicer_trio1_mother_37165_wtdbg2_ont_run123_hic_run1/Maison_plus/aligned/inter_30.hic - 10X: /work2/project/seqoccin/assemblies/scaffolders/10x/bos_taurus/LR_wtdbg_assembly_mother_37165_run123/Bovin-37165-align/outs/possorted_bam.bam - Assembly from JuiceBox: /work2/project/seqoccin/assemblies/scaffolders/hic/bos_taurus/juicer_trio1_mother_37165_wtdbg2_ont_run123_polished_hic_run1/Maison_plus/juicebox/mother_raw_wtdbg2_58x_polished.final.assembly

Offspring 2 - Polished contigs: /work/project/seqoccin/assemblies/polishing_1/bos_taurus/nfPolishing_trio2_offspring_37160_wtdbgassemblyrun12345_ontrun12345_10Xrun1234/PolishingResult/PILON_SR/assembly.pilon1.fa - ONT: /work/project/seqoccin/data/reads/nanopore/bos_taurus/trio2.offspring.*.fastq.gz - Hi-C: /work/project/seqoccin/data/reads/hic/bos_taurus/trio2.offspring* - 10X: /work/project/seqoccin/data/reads/10x/bos_taurus/Bovin-37160/*

Mother 1

~/Desktop/Apps/minimap2/minimap2 -t 10 -o /Scratch/mazytnicki/Mother_1/mapping.paf /Scratch/mazytnicki/GCF_002263795.1_ARS-UCD1.2_genomic.fna.gz /Scratch/mazytnicki/Mother_1/mother_raw.cgt.fa
~/Apps/samtools-1.10/samtools faidx /Scratch/mazytnicki/Mother_1/mother_raw.cgt.fa
cut -f 1-2 /Scratch/mazytnicki/Mother_1/mother_raw.cgt.fa.fai > /Scratch/mazytnicki/Mother_1/mother_raw.cgt.len

~/Apps/samtools-1.10/samtools view -c /Scratch/mazytnicki/Mother_1/possorted_bam.bam
# 1590040440

../CheckScaffolding/checkBreaks /Scratch/mazytnicki/Mother_1/mapping.paf /Scratch/mazytnicki/Mother_1/mother_raw.cgt.len 1000000 100000 outBreaks.tsv
../CheckScaffolding/checkBreaks /Scratch/mazytnicki/Mother_1/mapping.paf /Scratch/mazytnicki/Mother_1/mother_raw.cgt.len breaks_hic.tsv 100000 10000 predicted.out putative.out > difference_breaks_hic.txt
../CheckScaffolding/checkBreaks /Scratch/mazytnicki/Mother_1/mapping.paf /Scratch/mazytnicki/Mother_1/mother_raw.cgt.len breaks_ont.tsv 100000 10000 predicted.out putative.out > difference_breaks_ont.txt
../CheckScaffolding/checkChromosome /Scratch/mazytnicki/Mother_1/mapping.paf > scaffChrs.txt
../CheckScaffolding/checkStitch scaffChrs.txt stitch.txt

./checkJoins /Scratch/mazytnicki/Mother_1/mapping.paf joins.txt 10000 10000

zcat /Scratch/mazytnicki/Mother_1/trio1.mother.run1.fastq.gz | wc -l
# 12064000

~/Desktop/Apps/minimap2/minimap2 -t 10 -o /Scratch/mazytnicki/Mother_1/trio1.mother.run1.paf /Scratch/mazytnicki/Mother_1/mother_raw.cgt.fa /Scratch/mazytnicki/Mother_1/trio1.mother.run1.fastq.gz

wc -l /Scratch/mazytnicki/Mother_1/trio1.mother.run1.paf
# 4862048 /Scratch/mazytnicki/Mother_1/trio1.mother.run1.paf
~/Apps/samtools-1.10/samtools index -@ 5 /Scratch/mazytnicki/Mother_1/possorted_bam.bam

Offspring 2

scp mzytnicki@genologin.toulouse.inrae.fr:/work/project/seqoccin/assemblies/polishing_1/bos_taurus/nfPolishing_trio2_offspring_37160_wtdbgassemblyrun12345_ontrun12345_10Xrun1234/PolishingResult/PILON_SR/assembly.pilon1.fa /Scratch/mazytnicki/Offspring_2/

# From genologin
sbatch /home/mzytnicki/work/Projects/tenxChecker/map_offspring2.sh
sbatch /home/mzytnicki/work/Projects/tenxChecker/lrmkref.sh
sbatch /home/mzytnicki/work/Projects/tenxChecker/lralign.sh
sbatch /home/mzytnicki/work/Projects/tenxChecker/lralignSmall.sh

~/Apps/samtools-1.10/samtools faidx /Scratch/mazytnicki/Offspring_2/Juicer/references/assembly.pilon1.fa
cut -f 1-2 /Scratch/mazytnicki/Offspring_2/Juicer/references/assembly.pilon1.fa.fai > /Scratch/mazytnicki/Offspring_2/Juicer/references/assembly.pilon1.size
~/Apps/bwa-0.7.17/bwa index /Scratch/mazytnicki/Offspring_2/Juicer/references/assembly.pilon1.fa
python3 /home/mazytnicki/Apps/juicer/misc/generate_site_positions.py HindIII /Scratch/mazytnicki/Offspring_2/Juicer/references/assembly.pilon1.fa /Scratch/mazytnicki/Offspring_2/Juicer/references/assembly.pilon1_HindIII.txt
# Add bwa, samtools, & gawk to the PATH
PATH=~/bin:$PATH
/Scratch/mazytnicki/Offspring_2/Juicer/scripts/juicer.sh -d /Scratch/mazytnicki/Offspring_2/Juicer/ -z /Scratch/mazytnicki/Offspring_2/Juicer/references/assembly.pilon1.fa -p /Scratch/mazytnicki/Offspring_2/Juicer/references/assembly.pilon1.size -y /Scratch/mazytnicki/Offspring_2/Juicer/references/assembly.pilon1_HindIII.txt -D /Scratch/mazytnicki/Offspring_2/Juicer/ -t 6 -g bt37160 &> /Scratch/mazytnicki/Offspring_2/Juicer/juicer.out

ln -s /Scratch/mazytnicki/Offspring_2/Juicer/aligned/inter.hic /Scratch/mazytnicki/Offspring_2/inter.hic
scp mzytnicki@genologin.toulouse.inrae.fr:/home/mzytnicki/work/Projects/tenxChecker/alignment.bam /Scratch/mazytnicki/Offspring_2/

# LongRanger data does not seem sorted...
samtools sort -m 2G -o possorted_bam2.bam -@ 6 possorted_bam.bam
devtools::load_all(".")
resolution  <- 1000
bamFileName <- "/Scratch/mazytnicki/Mother_1/possorted_bam2.bam"
bamData     <- parseBamFile(bamFileName, resolution)
hicFileName <- "/Scratch/mazytnicki/Mother_1/inter_30.hic"
hicData     <- parseHicFile(hicFileName, resolution)
ontFileName <- "/Scratch/mazytnicki/Mother_1/trio1.mother.run1.paf"
ontData     <- parsePafFile(ontFileName, resolution, 500, 10, 10)
contigFileName <- "/Scratch/mazytnicki/Mother_1/mother_raw.cgt.fa"
scaffolder <- msscaf(contigFileName, resolution)
scaffolder <- addExp(scaffolder, ontData, "ONT")
scaffolder <- addExp(scaffolder, hicData, "HiC")
scaffolder <- addExp(scaffolder, bamData, "10X")
#saveRDS(scaffolder, file = "/Scratch/mazytnicki/Mother_1/allObject0.rds")
#scaffolder <- readRDS(file = "/Scratch/mazytnicki/Mother_1/allObject0.rds")
scaffolder <- estimateDistributions(scaffolder)
scaffolder <- cleanData(scaffolder)
#saveRDS(scaffolder, file = "/Scratch/mazytnicki/Mother_1/allObject.rds")
#scaffolder <- readRDS(file = "/Scratch/mazytnicki/Mother_1/allObject.rds")
scaffolder <- findBreaks(scaffolder, 0.01)
scaffolder <- splitChromosomes(scaffolder)
#saveRDS(scaffolder, file = "/Scratch/mazytnicki/Mother_1/allObject1.rds")
#scaffolder <- readRDS(file = "/Scratch/mazytnicki/Mother_1/allObject1.rds")
scaffolder <- findJoins(scaffolder, 0.05)
#saveRDS(scaffolder, file = "/Scratch/mazytnicki/Mother_1/allObject2.rds")
#scaffolder <- readRDS(file = "/Scratch/mazytnicki/Mother_1/allObject2.rds")
scaffolder <- scaffold(scaffolder)
Biostrings::writeXStringSet(Biostrings::DNAStringSet(scaffolder@sequences), "sequences.fa")
scaffolder@breaks %>% dplyr::select(ref, bin) %>% dplyr::mutate(bin = bin * resolution) %>% readr::write_tsv("breaks.tsv", col_names = FALSE)
scaffolder@joins %>% dplyr::arrange(pvalue) %>% dplyr::select(ref1, ref2) %>% readr::write_tsv("joins.tsv", col_names = FALSE)
../CheckScaffolding/verifyJoins /Scratch/mazytnicki/Mother_1/mapping.paf 100000 10000 joins.tsv
../CheckScaffolding/checkJoins /Scratch/mazytnicki/Mother_1/mapping.paf outJoins.tsv 1000 1000 joins.tsv
../CheckScaffolding/checkBreaks /Scratch/mazytnicki/Mother_1/mapping.paf /Scratch/mazytnicki/Mother_1/mother_raw.cgt.len 100000 10000 predicted.out breaks.tsv putative.out > difference_breaks.txt

# From tapou
~/Desktop/Apps/quast-5.0.2/quast-lg.py -t 15 -r /Scratch/mazytnicki/GCF_002263795.1_ARS-UCD1.2_genomic.fna.gz -t 6 -o /home/mazytnicki/Desktop/Projects/Other/Assembly/Quast/Test1 /Scratch/mazytnicki/Mother_1/mother_raw.cgt.fa wtdbg2_trio1_mother_37165_arcs_links.fa sequences.fa sequences_ONT_HiC.fa
# From genologin
scp /Scratch/mazytnicki/GCF_002263795.1_ARS-UCD1.2_genomic.fna.gz /Scratch/mazytnicki/Mother_1/mother_raw.cgt.fa wtdbg2_trio1_mother_37165_arcs_links.fa sequences.fa sequences_ONT_HiC.fa mzytnicki@genologin.toulouse.inrae.fr:/home/mzytnicki/work/Projects/tenxChecker
scp sequences.fa mzytnicki@genologin.toulouse.inrae.fr:/home/mzytnicki/work/Projects/tenxChecker
# From genologin
module load system/Miniconda3
module load bioinfo/Inspector-v1.2
cd /home/mzytnicki/work/Projects/tenxChecker/
sbatch inspector.sh
devtools::load_all(".")
#object <- makeSimObject(1, 1, 100, 20, 5, 0.2, c(1), c(50), c(), c(), c())
#object <- makeSimObject(2, 5, 100, 20, 10, 0.2, c(1, 2), c(25, 75), c(3, 4), c(4, 5), c(1, 5))
object <- makeSimObject(2, 5, 100, 100, 20, 0.5, c(1, 1, 2, 2), c(25, 75, 75, 75), c(3, 4), c(4, 5), c(1, 10))
#object <- makeSimObject(1, 2, 100, 20, 10, 0.2, c(), c(), c(1), c(2), c(5))
#plot.10X(object)
object <- estimateDistributions(object)
object <- cleanData(object)
object <- findBreaks(object, 0.05)
object <- splitChromosomes(object)
object <- findJoins(object, 0.05)
object <- scaffold(object)
devtools::load_all(".")
scaffolder <- readRDS(file = "/Scratch/mazytnicki/Mother_1/allObject.rds")
scaffolder <- readRDS(file = "/Scratch/mazytnicki/Mother_1/tmp.rds")
plot.msscafJoin(scaffolder, ref1 = "ctg672", ref2 = "ctg200", after1 = T, after2 = T, outliers = T, metaSize = T)
plot.msscafJoin(scaffolder, ref1 = "ctg853", ref2 = "ctg304", after1 = F, after2 = T, outliers = F)
plot.msscafJoin(scaffolder, ref1 = "ctg345", ref2 = "ctg326", after1 = F, after2 = F, outliers = F)
plot.msscafJoin(scaffolder, ref1 = "ctg369", ref2 = "ctg347", after1 = F, after2 = F, outliers = F, metaSize=T)
plot.msscafJoin(scaffolder, ref1 = "ctg502", ref2 = "ctg310", after1 = F, after2 = F, outliers = F, metaSize=T) # Nothing here
plot.msscafJoin(scaffolder, ref1 = "ctg975", ref2 = "ctg941", after1 = T, after2 = F, outliers = F, metaSize=T) # 100% outlier
plot.msscafJoin(scaffolder, ref1 = "ctg672", ref2 = "ctg200", after1 = T, after2 = T, outliers = F, metaSize=T)
plot.msscafJoin(scaffolder, ref1 = "ctg831", ref2 = "ctg534", after1 = T, after2 = T, outliers = F, metaSize=T)
plot.msscafJoin(scaffolder, ref1 = "ctg831", ref2 = "ctg534", after1 = F, after2 = T, outliers = T, metaSize=T) # Borderline
plot.msscafJoin(scaffolder, ref1 = "ctg295", ref2 = "ctg159", after1 = F, after2 = T, outliers = T, metaSize=T) 
plot.msscafJoin(scaffolder, ref1 = "ctg36", ref2 = "ctg18", after1 = T, after2 = F, outliers = T, metaSize=T) 
plot.msscafJoin(scaffolder, ref1 = "ctg75", ref2 = "ctg33", after1 = T, after2 = T, outliers = T, metaSize=T) 
sizes <- scaffolder@sizes
scaffolder <- checkJoins(scaffolder, 0.05)
scaffolder <- removeAmbiguousJoins(scaffolder)
scaffolder <- checkCorners(scaffolder, 0.05)
scaffolder <- mergeJoins(scaffolder, 0.05)

Oddly, the possorted is not always pos-sorted...

devtools::load_all(".")
resolution  <- 1000
#dirName     <- "/media/Stock/tenxchecker/Offspring_2"
dirName     <- "/Scratch/mazytnicki/Offspring_2"
bamFileName <- file.path(dirName, "possorted_bam2.bam") # 438,352,223 elements
bamData     <- parseBamFile(bamFileName, resolution)
hicFileName <- file.path(dirName, "inter.hic")
hicData     <- parseHicFile(hicFileName, resolution)
ontFileName <- file.path(dirName, "alignment.paf") # 19,005,147 elements
ontData     <- parsePafFile(ontFileName, resolution, 500, 10, 10)
contigFileName <- file.path(dirName, "assembly.pilon1.fa")
scaffolder <- msscaf(contigFileName, resolution)
scaffolder <- addExp(scaffolder, ontData, "ONT")
scaffolder <- addExp(scaffolder, hicData, "HiC")
scaffolder <- addExp(scaffolder, bamData, "10X")
#saveRDS(scaffolder, file = "/Scratch/mazytnicki/Offspring_2/allObject0.rds")
#scaffolder <- readRDS(file = "/Scratch/mazytnicki/Offspring_2/allObject0.rds")
scaffolder <- estimateDistributions(scaffolder)
scaffolder <- cleanData(scaffolder)
#saveRDS(scaffolder, file = "/Scratch/mazytnicki/Offspring_2/allObject.rds")
#scaffolder <- readRDS(file = "/Scratch/mazytnicki/Offspring_2/allObject.rds")
scaffolder <- findBreaks(scaffolder, 0.01)
scaffolder <- splitChromosomes(scaffolder)
#saveRDS(scaffolder, file = "/Scratch/mazytnicki/Offspring_2/allObject1.rds")
#scaffolder <- readRDS(file = "/Scratch/mazytnicki/Offspring_2/allObject1.rds")
#Biostrings::writeXStringSet(Biostrings::DNAStringSet(scaffolder@sequences), "sequencesSplit.fa")
scaffolder <- findJoins(scaffolder, 0.05)
#saveRDS(scaffolder, file = "/Scratch/mazytnicki/Offspring_2/allObject2.rds")
#scaffolder <- readRDS(file = "/Scratch/mazytnicki/Offspring_2/allObject2.rds")
scaffolder <- scaffold(scaffolder)
Biostrings::writeXStringSet(Biostrings::DNAStringSet(scaffolder@sequences), "sequences.fa")
scaffolder@breaks %>% dplyr::select(ref, bin) %>% dplyr::mutate(bin = bin * resolution) %>% readr::write_tsv("breaks.tsv", col_names = FALSE)
scaffolder@joins %>% dplyr::arrange(pvalue) %>% dplyr::select(ref1, ref2) %>% readr::write_tsv("joins.tsv", col_names = FALSE)

Tigmint

# from genologin
cd /home/mzytnicki/work/Projects/tenxChecker/Offspring
sbatch lrb.sh
fold refdata-assembly.pilon1/fasta/genome.fa > genome.fa
sbatch tigmint.sh
# from tapou
mazytnicki@tapou:~/Projects/Other/Assembly/tenxchecker$ scp mzytnicki@genologin.toulouse.inrae.fr:/home/mzytnicki/work/Projects/tenxChecker/Offspring/genome.tigmint.arcs.fa /Scratch/mazytnicki/Offspring_2/

Quast

~/Desktop/Apps/quast-5.0.2/quast-lg.py -t 15 -r /Scratch/mazytnicki/Offspring_2/assembly.pilon1.fa -t 6 -o /home/mazytnicki/Desktop/Projects/Other/Assembly/Quast/Offspring sequences.fa /Scratch/mazytnicki/Offspring_2/genome.tigmint.arcs.fa

Inspector ```{bash} scp sequencesSplit.fa mzytnicki@genologin.toulouse.inrae.fr:/home/mzytnicki/work/Projects/tenxChecker/Offspring/ scp sequences.fa mzytnicki@genologin.toulouse.inrae.fr:/home/mzytnicki/work/Projects/tenxChecker/Offspring/sequencesMs.fa

From genologin

cd /home/mzytnicki/work/Projects/tenxChecker/ sbatch inspector_raw.sh sbatch inspector_tigmint.sh sbatch inspector_tigmint0.sh sbatch inspector_mss.sh

scp mzytnicki@genologin.toulouse.inrae.fr:/home/mzytnicki/work/Projects/tenxChecker/Offspring/genome.tigmint.arcs.fa /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/references mv /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/references/genome.tigmint.arcs.fa /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/references/tmp sed '/^>/s/ .*//' /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/references/tmp | fold > /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/references/genome.tigmint.arcs.fa ~/Apps/samtools-1.10/samtools faidx /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/references/genome.tigmint.arcs.fa cut -f 1-2 /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/references/genome.tigmint.arcs.fa.fai > /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/references/genome.tigmint.arcs.size ~/Apps/bwa-0.7.17/bwa index /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/references/genome.tigmint.arcs.fa cd /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/restriction_sites python3 /home/mazytnicki/Apps/juicer/misc/generate_site_positions.py HindIII genome.tigmint.arcs.fa /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/references/genome.tigmint.arcs.fa

Add bwa, samtools, & gawk to the PATH

PATH=~/bin:$PATH mkdir splits ln -s fastq/* splits/ /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/scripts/juicer.sh -d /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/ -z /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/references/genome.tigmint.arcs.fa -p /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/references/genome.tigmint.arcs.size -y /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/restriction_sites/genome.tigmint.arcs.fa_HindIII.txt -D /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/ -t 6 -g bt37160 &> /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/juicer.out samtools view -O SAM -F 1024 /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/aligned/merged_dedup.bam | awk -v mnd=1 -f /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/scripts/common/sam_to_pre.awk > /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/aligned/merged_nodups.txt xz -T 5 /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/aligned/merged_nodups.txt scp /Scratch/mazytnicki/Offspring_2/Juicer/Tigmint/aligned/merged_nodups.txt.xz mzytnicki@genologin.toulouse.inrae.fr:/home/mzytnicki/work/Projects/tenxChecker/Offspring/



mzytnicki/msscaf documentation built on Oct. 9, 2022, 8:08 p.m.