knitr::opts_chunk$set(collapse = T, comment = "#>", fig.width=6)
options(tibble.print_min = 4L, tibble.print_max = 4L)
set.seed(123)
library(TRALResultAnalysis)

Introduction

[TODO: Extend and reformulate]

NF-κB is a signaling pathway that takes part in cell proliferation and inflammation mechanisms. It consists of five subunits acting as transcription factors, RelA/p65, c-Rel, RelB, p50/NF-κB1 and p52/NF-κB2, that are able to dimerize and are sequestered in the cytoplasm by Iκb proteins. The IKK complex, consisting of two catalytic (IKKα and IKKβ) and one regulatory (IKKγ) subunits, represents the major regulator of this pathway. It acts by phosphorylating Iκb proteins targeting them for proteasomal degradation. Iκb degradation releases NF-κB proteins in the cytoplasm. Thus, they are free to translocate into the nucleus and activate transcription of specific genes (41,42). [TODO][Rosa2015]

[Rosa2015][https://doi.org/10.3892/or.2015.4108]

Methods

Data Sources and Collection

All protein sequences were retrieved via the REST API from UniProt/Swiss-Prot Knowledgebase 23 release 2019_04 for the reviewed entries of the Homo Sapiens proteome (UP000005640).

Protein Sequence Origin

The names of CRC favorable and unfavorable genes were downloaded manually (Mai 2019) from the proteinatlas website [24][proteinatlasCRC] and saved as .tsv. The sequences of proteins which are expressed from CRC favorable and unfavorable genes, were retreived from Swiss-Prot by querying all proteins from each gene. Next, we checked the .fasta file for duplicated protein IDs, which requiered manual inspection.

Analogously, the protein sequences for all proteins which are related to the Wnt-pathway in Swiss-Prot were retrieved and filtered for duplicates.

Tandem Repeat Detection \& Filtering

For the detection and filtering of TRs, we applied the open-source Python 3 Tandem Repeat Annotation Library (TRAL) 25. De novo detection of TRs in protein sequences was performed applying the detectors HHrepID, T-REKS, TRUST and XSTREAM followed by a refinement with profile Hidden Markov Models (HMM) implemented with HMMER 26 to capture most of the TR units and their boundaries. The TRs, were then filtered on a significance level of $\alpha = 0.05$ for a Likelyhood Ratio Test under the null hypothesis of random TR sequence evolution, a TR unit divergence of $d_{TR_{units}} < 0.1$ specifying the number of substitution site dating back to the most recent TR unit ancestor, a TR unit length of $l \leq 1000$ specifying the number of noninsertion sites of the TR unit and a TR unit number of $n \geq 2.5$ specifying the number of noninsertion sites in the TR-MSA divided by $l$ [27][Schaper2012].

Processing TRAL Results for Data Mining in R

The detected TR characteristics and their MSA are stored for each protein in a single .tsv and binary .pickle file. For further analysis, we join the results for all proteins in one file for each group (unfavorable, favorable, Wnt-pathway) containing only proteins which have TRs.

The TR detection and annotation workflow is implemented in Python 3 and freely available on Github. For reproduction of the reported results, change the code in order to use the provided datasets. To obtain the most recent datasets, just adapt main.py regarding your systems specification before sourcing.

TR Characterization in CRC Associated Proteins

To provide a reproducible framework of the TR characterization the following results including the data of the previous section is implemented as an R-package on github which can be installed like a regular R-package. Sourcing the vignette GeneralOverviewofTandemRepeats.Rmd allows an exact reproduction of the reported results.

The evaluation was performed using

sessionInfo()
base_path <- "/home/delt/ZHAW/TRALResultAnalysis/"
library(TRALResultAnalysis)
tr_crcfavorable_path <- paste0(base_path, "inst/extdata/TRs_favorable_proteins_CRC_sp_l1000.tsv")
tr_crcunfavorable_path <- paste0(base_path, "inst/extdata/TRs_unfavorable_proteins_CRC_sp_l1000.tsv")
tr_nfkappab_path <- paste0(base_path, "inst/extdata/TRs_nfkappab_proteins_CRC_sp_l1000.tsv")
dest_file_sp <- paste0(base_path, "data/swissprot_human.tsv")
dest_file_kin <- paste0(base_path, "data/swissprot_human_kinome.tsv")
tr_fav <- load_tr_annotations(tr_crcfavorable_path)
tr_unfav <- load_tr_annotations(tr_crcunfavorable_path)
tr_nfkappab <- load_tr_annotations(tr_nfkappab_path)

To get deeper insights about the proteins containing TRs, more information is added from Swiss-Prot by a left join

sp_url <- "https://www.uniprot.org/uniprot/?query=organism:%22Homo%20sapiens%20(Human)%20[9606]%22%20reviewed:yes&format=tab&columns=id,entry%20name,reviewed,protein%20names,genes,organism,length,virus%20hosts,encodedon,database(Pfam),interactor,comment(ABSORPTION),feature(ACTIVE%20SITE),comment(ACTIVITY%20REGULATION),feature(BINDING%20SITE),feature(CALCIUM%20BIND),comment(CATALYTIC%20ACTIVITY),comment(COFACTOR),feature(DNA%20BINDING),ec,comment(FUNCTION),comment(KINETICS),feature(METAL%20BINDING),feature(NP%20BIND),comment(PATHWAY),comment(PH%20DEPENDENCE),comment(REDOX%20POTENTIAL),rhea-id,feature(SITE),comment(TEMPERATURE%20DEPENDENCE)&sort=organism"

# Download the swissprot file only if it doesn't already exist.
# Uncomment this, if you want to use the most recent available data!
if(!file.exists(dest_file_sp)){
    download.file(sp_url, destfile = dest_file_sp)
}

sp_all_fav <- load_swissprot(dest_file_sp, tr_fav)
sp_all_unfav <- load_swissprot(dest_file_sp, tr_unfav)
sp_all_nfkappab <- load_swissprot(dest_file_sp, tr_nfkappab)

tr_unfav_sp <- merge(x = tr_unfav, y = sp_all_unfav, by = "ID", all.x = TRUE)
tr_fav_sp <- merge(x = tr_fav, y = sp_all_fav, by = "ID", all.x = TRUE)
tr_nfkappab_sp <- merge(x = tr_nfkappab, y = sp_all_nfkappab, by = "ID", all.x = TRUE)

Results \& Discussion

In 243 genes associated with unfavorable CRC expressing 286 proteins we could de novo detect 572 TRs of which after filtering and clustering 67 TRs remained. In CRC favorable genes (352) 403 proteins are expressed, where we could de novo detect 806 TRs with 42 TRs beeing left after filtering and clustering.

Additionally, 604 proteins were associated in Swiss-Prot with the NF-$\kappa$-B-pathway, where we could de novo detect 1208 TRs of which 100 TRs remained after filtering and clustering.

More TRs Were Detected in Proteins Associated With Unfavorable CRC Prognosis

# CRC favorable
length(unique(tr_fav_sp$ID))/403 # through unique() we ensure to count those proteins with >1 TR only once.

# CRC unfavorable
length(unique(tr_unfav_sp$ID))/286 

# Wnt pathway
length(unique(tr_nfkappab_sp$ID))
length(unique(tr_nfkappab_sp$ID))/604

Of all proteins expressed by genes associated for beeing favorable or unfavorable for CRC, 10% and 23% contain at least one TR respectively. Analogously, we could detect TRs in 100 (17%) of the 604 proteins involved in the Human NF-$\kappa$-B-pathway.

AA bias in TRs: AAs compared in TRs to the whole protein sequence

# combine all TR from the three groups
tr_all <- rbind(tr_fav, tr_unfav, tr_nfkappab)

AAfreq_all <- AAfreq_in_TR(tr_all) 
AAfreq_all[base::order(AAfreq_all$aa_ratio, decreasing = TRUE),]

We combine all datasets and count the amino acid frequency. We can see, that Alanine is most frequent (13%) followed by Proline (11%) and Serine (11%) in CRC and NF-$\kappa$-B pathway associated proteins. This is analogous to the results from the Wnt-Pathway.

# (AAfreq_fav <- AAfreq_in_TR(tr_fav))
# (AAfreq_unfav <- AAfreq_in_TR(tr_unfav))
# cor.test(AAfreq_fav$aa_ratio, AAfreq_unfav$aa_ratio, method = "spearman")
AAratio_vs_Disorderpropensity(tr_all, plot_title = "NF-kappa-B-Pathway Associated Proteins")

Plotting the amino acid frequency ratio of TRs in CRC and NF-$\kappa$-B pathway associated proteins against the amino acid disorderpropensity we can see that more amino acids with high disorder propensity are part of TR in CRC and NF-$\kappa$-B pathway associated proteins - with exception of Lysine.

AAratio_vs_Disorderpropensity(sp_overall = TRUE)

This trend is consistent with the overall amino acid frequency of all known proteins - amino acids with higher disorderpropensity appear in TRs significantly more frequent.

We now compare the AA frequency in from the TR sequence to the whole protein sequence in proteins from the NF-$\kappa$-B pathway.

AAfreq_whole_prot_nfkappab <- AAfreq_in_fasta(path = "/home/delt/ZHAW/CRC_TRs/data/nfkappab_proteins_CRC_sp.fasta")
AAfreq_TR_nfkappab <- AAfreq_in_TR(tr_nfkappab)

AAfreq <- AAfreq_in_TR(tr_nfkappab)
AAfreq$aa_ratio_overall <- AAfreq_whole_prot_nfkappab$aa_ratio

plot_AAratio_comparison(df = AAfreq, plot_title = "Whole Protein Sequence AA Frequency in NF-kappa-B", ylim = 0.2)
p1 <- plot_AAratio(df = AAfreq_whole_prot_nfkappab, plot_title = "Whole Protein Sequence AA Frequency in NF-kappa-B", ylim = 0.2)
p2 <- plot_AAratio(df = AAfreq_in_TR(tr_nfkappab), plot_title = "Tandem Repeat AA Frequency in NF-kappa-B", ylim = 0.2)
grid.arrange(p1, p2, nrow=2)

We can see, that the overall ratio of AAs in TRs (green) and the whole protein sequences (red) is showing similar trends. However, certain AAs appear more frequently in TRs compared to the rest of the proteinsequence, i.e. A, Q, I, M, S, W. They all have a generally high disorder propensity supporting the theory that TRs are enriched with disorder promoting residues and hence appear often within disordered regions.

TR length distribution in the NF-$\kappa$-B Pathway

p1 <- ggplot(data = tr_nfkappab, aes(x = l_effective))+
  geom_histogram(binwidth = 1,)+
  labs(x = "TR unit length",
       title = "TR length distribution in NF-kappa-B Pathway Proteins")

p1 <- beautifier(p1)+
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5))+
  scale_x_continuous(breaks=seq(0, 80, by = 5))
p1

Most of the TRs are homoTRs or microTR with 2 AA per unit.

Let's examine the longer ones.

tr_nfkappab_sp[which(tr_nfkappab_sp$l_effective >5),c(1,4,5,14)]

We find the expected Ubiquitin, Mucin and Zn-finger proteins. But also the Adipocyte enhancer-binding protein 1. [TODO: look closer at adipocyte enhancer binding protein 1]

Next comes a nice plot combining the information from above in one plot.

plot_TRunitnoTRunitlength(tr_sp = tr_nfkappab_sp, ngeq = 15, lgeq = 5)

Interesting TRs

Summarising uniprot functions with a focus on CRC.

NCOA6

Interacts with NFKB1 subunit. Possibly involved in NF-$\kappa$-B activation.

tr_nfkappab_sp[which(tr_nfkappab_sp$prot_name == 'NCOA6_HUMAN'),c(1,31)]

AEBP1

Might positively regulate NF-kappa-B activity in macrophages by promoting the phosphorylation and subsequent degradation of I-kappa-B-alpha (NFKBIA), leading to enhanced macrophage inflammatory responsiveness.

tr_nfkappab_sp[which(tr_nfkappab_sp$prot_name == 'AEBP1_HUMAN'),c(1,31)]

TAOK

Isoform 1 (there are two in total) plays a role in morphological changes. It also binds ot microtubules and affects their organization and stability independently of its kinase activity. Prevents MAP3K7-mediated activation of CHUK and zhus NF-kappa-B activation but not that of MAPK8/JNK.

tr_nfkappab_sp[which(tr_nfkappab_sp$prot_name == 'TAOK2_HUMAN'),c(1,31)]

PEG3

Induces apoptosis in cooperation with SIAH1A. Acts as a mediator between p53/TP53 and BAX in a neuronal death pathway that is activated by DNA damage. Acts synergistically with TRAF2 and inhibits TNF induced apoptosis through activation of NF-kappa-B (By similarity). Possesses a tumor suppressing activity in glioma cells. {ECO:0000250, ECO:0000269|PubMed:11260267}.

tr_nfkappab_sp[which(tr_nfkappab_sp$prot_name == 'PEG3_HUMAN'),c(1,31)]

CBP

Transcriptional coactivator for SMAD4 in the TGF-beta signaling pathway. Acetylates histones, gining a specific tag for transcriptional activation.

tr_nfkappab_sp[which(tr_nfkappab_sp$prot_name == 'CBP_HUMAN'),c(1,31)]

NUMBL

Negative regulator of NF-kappa-B signaling pathway. The inhibition of NF-kappa-B activation is mediated at least in part, by preventing MAP3K7IP2 to interact with polyubiquitin chains of TRAF6 and RIPK1 and by stimulating the 'Lys-48'-linked polyubiquitination and degradation of TRAF6 in cortical neurons.

tr_nfkappab_sp[which(tr_nfkappab_sp$prot_name == 'NUMBL_HUMAN'),c(1,31)]

When we look more close to the TRs with many ($\ge 8 $) units, we see the following:

tr_nfkappab_sp[which(tr_nfkappab_sp$n_effective >= 8),c(1,31)]

They are all homo repeats with Ubiquitin and Mucin as exceptions. [TODO: investigate their functions]

NF-kappa-B in CRC associated proteins

(tr_fav_sp[which(tr_fav_sp$ID %in% tr_nfkappab_sp$ID),])

Two Proteins

nrow(tr_unfav[which(tr_unfav$ID %in% tr_nfkappab$ID),])


matteodelucchi/TRAL-Result-Analysis documentation built on Dec. 2, 2019, 11:42 p.m.