knitr::opts_chunk$set(collapse = T, comment = "#>", fig.width=6)
options(tibble.print_min = 4L, tibble.print_max = 4L)
set.seed(123)

Variations in tandem repeats are known to be involved in colorectal cancer development which is one of the leading cause for cancer related mortality. To improve the understanding of tandemly repeated sequences in proteins, we perform a functional analysis of colorectal cancer associated proteins and members of the Wnt/$\beta$-catenin pathway [TODO: which is one of the driver pathways in CRC/upregulated in CRC]. We could support previous findings of tandem repeats beeing involved in protein-protein interactions and find new potential candidate proteins for tandem repeat disorder caused colorectal cancer. By this, we lay a foundation for a systems wide functional analyses of colorectal cancer associated pathways to provide a full list of potential proteomic driver tandem repeats.

Introduction

In Switzerland, colorectal cancer is among the top 3 leading cause for cancer-related mortality 1. As the disease shows varying clinical outcomes and therapeutic responses, researchers try through personalized medicine to find individually optimal treatments.

The development of colorectal cancer (CRC) is accompanied by genetic and epigenetic mutations in onco- and tumor-suppressor (unfavorable and favorable CRC) genes which are supposed to cause CRC onset, progression and metastasis 2 3.

Some clinical relevant biomarkers are today already used as prognostic and/or therapeutic predictive markers 4.

Colorectal Cancer on a Molecular Level

Undifferentiated cells, such as colon stem progenitor cells, are located at the bottom of a crypt in healthy colons. They are able of self-renewal and pluripotency. During their life-cycle of about 14 days, they differentiate in all epithelial colon cell lineages (Paneth, goblet, enterocytes and enteroendocrine) while migrating from the crypt to the top of the villus leading into apoptosis 5. The maintenance of genomic integrity is essential for proper cell differentiation in the colon. If it is lost, it's easy to accumulate multiple mutations through chromosomal instability 6, microsatellite instability 7, aberrant DNA methylation 8 and/or DNA repair defects leading to CRC.

Several altered molecular signaling pathways are involved in CRC onset, such as Wnt/APC/$\beta$-catenin, phosphoinositide 3-kinase (PI3K)/AKT/glycogen synthase kinase-3$\beta$ (GSK-3$\beta$), transforming growth factor-$\beta$ (TGF)-$\beta$/Smad, NF- κb or mismatch repair genes (MMR) 9.

Wnt-signaling plays a key role in early cell fate decision in embryonic stem cells and is therefore under extensive investigation in cancer research. For example, the Wnt target gene LGR5 is a multipotent stem cell marker in crypts. ASCL2, SOX9 and Paneth cell defensins are other examples of Wnt target genes controlling various steps in the differentiation process of stem cells.

Wnt / $\beta$-catenin pathway

Upon secretion of Wnt ligand it binds to Frizzled (Fz) receptors. This results in the inactivation of the multifunctional kinase GSK-3$\beta$ and in the stabilisation of $\beta$-catenin. $\beta$-catenin is an E-cadherin cell-cell adhesion protein and transcriptional activator. If it is stabilized, it gets accumulated in the cytoplasm and is translocated into the nucleus. There it interacts with members of the lymphoid enhancer factor (LEF)/t-cell factor (TCF) and activates specific target genes. In absence of Wnt-signal, $\beta$-catenin is degradated preventing its nuclear translocation. The ubiquination and phosphorylation of $\beta$-catenin is controlled through casein kinase 1 (CK1) and the APC/Axin/GSK-3$\beta$-complex.

Activation of the Wnt pathway through mutations in the APC gene is known to have a leading role in CRC pathogenesis. The protein product of APC is among the key components of $\beta$-catenin destruction leading to an accumulation of $\beta$-catenin which is then translocated to the nucleus where it mediates gene transcription through interaction with the LEF/TCF transcription factor family (LEF-1/LEF1, TCF-1/TCF7, TCF-3/TCF7L1, and TCF-4/TCF7L2) 10. The same outcome is expected, however, less frequently, through mutations in $\beta$-catenin and Axin2 11 12

An increased concentration of $\beta$-catenin in the cytoplasm promotes cellular proliferation. Nuclear expression of $\beta$-catenin can be identified immunohistochemically but is not a unique factor for CRC. However, it's among one of the components of a diagnostic panel 13.

Wnt-Signaling and Colon Cancer

In different cancer types the $\beta$-catenin dependent pathway is differentially regulated in normal tissue compared to cancerous tissue and often shows $\beta$-catenin stabilizing mutations 14 15 resulting in enhanced Wnt-signaling. Of many upstream and downstream components of $\beta$-catenin, APC and $\beta$-catenin are among the ones which show crucial roles in Wnt-signaling and cancer.

Looking at mouse models with activated Wnt-pathway through APC mutations, revealed that mutations in APC alone are not sufficient for carcinogenic development. The Apc^min/+^ mouse models showed non-invasive intestinal tumors 16. For unfavorable carcinoma development an increased activation of proto-oncogenetic signaling pathways (i.e. K-ras) and additional mutations in TGFBRII, SMAD4, TP53 may be necessary 17.

Further, LEF/TCF transcription factors play a crucial role in CRC and may offer therapeutic opportunities 18. The recruitment of $\beta$-catenin to target genes in Wnt-signaling is depending on a $\beta$-catenin binding domain at the N-terminus of the LEF/TCF transcription factor. At the C-terminus, a sequence-specific DNA binding domain recognize Wnt response elements in target genes. Interestingly, different LEF/TCF loci produce different protein isoforms through alternative splicing and different promotors which influence Wnt-signaling and can lead to cancer development 19 20. The expression pattern of LEF/TCF loci varies between normal and cancerous human colon tissue. In healthy colon cells, TCF-1 and TCF-4 are expressed and LEF-1 and TCF-3 are silent. It could be shown in mouse models, that a loss of TCF-4 depeletes stem cells and eliminates cell proliferation in the intestine 21. TCF-1 is suggested to be a feedback repressor of TCF-4 and cooperates with APC to suppress CRC. In cancerous colon cells, TCF-1 and TCF-4 are still expressed (eventhough now as a full-length form) but also LEF-1 is produced which results in a maximal $\beta$-catenin interaction and hence in oncogenic Wnt-signaling.

Short Tandem Repeats in Wnt-Pathway

We could show in 22 that TRs in proteins cluster to the flanks of protein sequences and are involved in protein-protein interactions as well as in transcription processes and DNA-binding. Therefore, we investigate if we can detect TRs with the state of the art classification method 23 in proteins generally associated with favorable and unfavorable CRC prognosis 24 and in Wnt/$\beta$-catenin pathway related proteins with a focus on LEF/TCF transcription factors.

Methods

Data Sources and Collection

All protein sequences were retrieved via the REST API from UniProt/Swiss-Prot Knowledgebase 23 release 2019_04 for the reviewed entries of the Homo Sapiens proteome (UP000005640).

Protein Sequence Origin

The names of CRC favorable and unfavorable genes were downloaded manually (Mai 2019) from the proteinatlas website 24 and saved as .tsv. The sequences of proteins which are expressed from CRC favorable and unfavorable genes, were retreived from Swiss-Prot by querying all proteins from each gene. Next, we checked the .fasta file for duplicated protein IDs, which requiered manual inspection.

Analogously, the protein sequences for all proteins which are related to the Wnt-pathway in Swiss-Prot were retrieved and filtered for duplicates.

Tandem Repeat Detection \& Filtering

For the detection and filtering of TRs, we applied the open-source Python 3 Tandem Repeat Annotation Library (TRAL) 25. De novo detection of TRs in protein sequences was performed applying the detectors HHrepID, T-REKS, TRUST and XSTREAM followed by a refinement with profile Hidden Markov Models (HMM) implemented with HMMER 26 to capture most of the TR units and their boundaries. The TRs, were then filtered on a significance level of $\alpha = 0.05$ for a Likelyhood Ratio Test under the null hypothesis of random TR sequence evolution, a TR unit divergence of $d_{TR_{units}} < 0.1$ specifying the number of substitution site dating back to the most recent TR unit ancestor, a TR unit length of $l \leq 1000$ specifying the number of noninsertion sites of the TR unit and a TR unit number of $n \geq 2.5$ specifying the number of noninsertion sites in the TR-MSA divided by $l$ 27.

Processing TRAL Results for Data Mining in R

The detected TR characteristics and their MSA are stored for each protein in a single .tsv and binary .pickle file. For further analysis, we join the results for all proteins in one file for each group (unfavorable, favorable, Wnt-pathway) containing only proteins which have TRs.

The TR detection and annotation workflow is implemented in Python 3 and freely available on Github. For reproduction of the reported results, change the code in order to use the provided datasets. To obtain the most recent datasets, just adapt main.py regarding your systems specification before sourcing.

TR Characterization in CRC Associated Proteins

To provide a reproducible framework of the TR characterization the following results including the data of the previous section is implemented as an R-package on github which can be installed like a regular R-package. Sourcing the vignette GeneralOverviewofTandemRepeats.Rmd allows an exact reproduction of the reported results.

The evaluation was performed using

sessionInfo()
base_path <- "/home/delt/ZHAW/TRALResultAnalysis/"
library(TRALResultAnalysis)
tr_crcfavorable_path <- paste0(base_path, "inst/extdata/TRs_favorable_proteins_CRC_sp_l1000.tsv")
tr_crcunfavorable_path <- paste0(base_path, "inst/extdata/TRs_unfavorable_proteins_CRC_sp_l1000.tsv")
tr_wnt_path <- paste0(base_path, "inst/extdata/TRs_Wnt_proteins_CRC_sp_l1000.tsv")
dest_file_sp <- paste0(base_path, "data/swissprot_human.tsv")
dest_file_kin <- paste0(base_path, "data/swissprot_human_kinome.tsv")
tr_fav <- load_tr_annotations(tr_crcfavorable_path)
tr_unfav <- load_tr_annotations(tr_crcunfavorable_path)
tr_wnt <- load_tr_annotations(tr_wnt_path)

To get deeper insights about the proteins containing TRs, more information is added from Swiss-Prot by a left join

sp_url <- "https://www.uniprot.org/uniprot/?query=organism:%22Homo%20sapiens%20(Human)%20[9606]%22%20reviewed:yes&format=tab&columns=id,entry%20name,reviewed,protein%20names,genes,organism,length,virus%20hosts,encodedon,database(Pfam),interactor,comment(ABSORPTION),feature(ACTIVE%20SITE),comment(ACTIVITY%20REGULATION),feature(BINDING%20SITE),feature(CALCIUM%20BIND),comment(CATALYTIC%20ACTIVITY),comment(COFACTOR),feature(DNA%20BINDING),ec,comment(FUNCTION),comment(KINETICS),feature(METAL%20BINDING),feature(NP%20BIND),comment(PATHWAY),comment(PH%20DEPENDENCE),comment(REDOX%20POTENTIAL),rhea-id,feature(SITE),comment(TEMPERATURE%20DEPENDENCE)&sort=organism"

# Download the swissprot file only if it doesn't already exist.
# Uncomment this, if you want to use the most recent available data!
if(!file.exists(dest_file_sp)){
    download.file(sp_url, destfile = dest_file_sp)
}

sp_all_fav <- load_swissprot(dest_file_sp, tr_fav)
sp_all_unfav <- load_swissprot(dest_file_sp, tr_unfav)
sp_all_wnt <- load_swissprot(dest_file_sp, tr_wnt)

tr_unfav_sp <- merge(x = tr_unfav, y = sp_all_unfav, by = "ID", all.x = TRUE)
tr_fav_sp <- merge(x = tr_fav, y = sp_all_fav, by = "ID", all.x = TRUE)
tr_wnt_sp <- merge(x = tr_wnt, y = sp_all_wnt, by = "ID", all.x = TRUE)

Results \& Discussion

In 243 genes associated with unfavorable CRC expressing 286 proteins we could de novo detect 572 TRs of which after filtering and clustering 67 TRs remained. In CRC favorable genes (352) 403 proteins are expressed, where we could de novo detect 806 TRs with 42 TRs beeing left after filtering and clustering.

Additionally, 644 proteins were associated in Swiss-Prot with the Wnt-pathway, where we could de novo detect 2768 TRs of which 268 TRs remained after filtering and clustering.

More TRs Were Detected in Proteins Associated With Unfavorable CRC Prognosis

# CRC favorable
length(unique(tr_fav_sp$ID))/403 # through unique() we ensure to count those proteins with >1 TR only once.

# CRC unfavorable
length(unique(tr_unfav_sp$ID))/286 

# Wnt pathway
length(unique(tr_wnt_sp$ID))
length(unique(tr_wnt_sp$ID))/644 

Of all proteins expressed by genes associated for beeing favorable or unfavorable for CRC, 10% and 23% contain at least one TR respectively. Analogously, we could detect TRs in 268 (42%) of the 644 proteins involved in the Human Wnt-pathway.

Homo-TR Cluster Towards the N-Terminus

TRs generally cluster to the flanks of proteins with a tendency of domain-TRs to cluster near the N-terminus TODO.

TR_location(
  rbind(tr_fav_sp, tr_unfav_sp),
  plot_title = "CRC Favorable & Unfavorable Proteins",
  byTRtype = TRUE)

In proteins expressed from CRC favorable and unfavorable genes, homo-TRs show a tendency for N-terminal location. Many of the homo-TR fall in the region allocated to signal sequences which tags the proteins for extracellular translocation through the cell membrane.

TR_location(tr_fav_sp, 
            plot_title = "CRC Favorable Proteins",
            byTRtype = TRUE)
TR_location(tr_unfav_sp,   
            plot_title = "CRC Unfavorable Proteins", 
            byTRtype = TRUE)

Looking seperately at CRC favorable and unfavorable proteins it can be observed that favorable CRC proteins have a greater tendency for N-terminal location of their homo-TRs. Which allows to hypothesize, that favorable CRC proteins may have a signal sequence and play extracellular roles more frequently than CRC unfavorable proteins.

It could be shown TODO that N-terminal signal peptides are rich in Leucine residues.

# combine all TR from the three groups
tr_all <- rbind(tr_fav, tr_unfav, tr_wnt)

AAfreq_all <- AAfreq_in_TR(tr_all) 
AAfreq_all[base::order(AAfreq_all$aa_ratio, decreasing = TRUE),]

AAfreq_all_homo <- AAfreq_in_TR(tr_all[which(tr_all$l_type == "homo"),]) 
AAfreq_all_homo[base::order(AAfreq_all_homo$aa_ratio, decreasing = TRUE),]

AAfreq_all[which(AAfreq_all$aa == "L"),]
AAfreq_all_homo[which(AAfreq_all_homo$aa == "L"),]

We combine all datasets and count the amino acid frequency and compare it to the amino acid frequency in homo-TRs which doesn't show much difference for Lysine. However, we can see, that Alanine is most frequent (14%) followed by Proline (12%) and Serine (12%) in CRC and Wnt-/$\beta$-catenin pathway associated proteins.

# (AAfreq_fav <- AAfreq_in_TR(tr_fav))
# (AAfreq_unfav <- AAfreq_in_TR(tr_unfav))
# cor.test(AAfreq_fav$aa_ratio, AAfreq_unfav$aa_ratio, method = "spearman")
AAratio_vs_Disorderpropensity(tr_all, plot_title = "CRC & Wnt-Pathway Associated Proteins")

Plotting the amino acid frequency ratio of TRs in CRC and Wnt-/$\beta$-catenin pathway associated proteins against the amino acid disorderpropensity we can see that more amino acids with high disorder propensity are part of TR in CRC and Wnt-/$\beta$-catenin pathway associated proteins - with exception of Lysine.

AAratio_vs_Disorderpropensity(sp_overall = TRUE)

This trend is consistent with the overall amino acid frequency of all known proteins - amino acids with higher disorderpropensity appear in TRs significantly more frequent.

Only Few Wnt-pathway Proteins Are Associated With CRC.

Proteins with TRs from the Wnt-pathway seem to be distinct from the proteins retrieved by genes.

# CRC favorable proteins in Wnt pathway
tr_fav_sp$protein_name[which(tr_fav_sp$ID %in% tr_wnt_sp$ID)]

5 Proteins which are associated with favorable CRC prognosis are involved in the Wnt-pathway and

# CRC unfavorable proteins in Wnt pathway
tr_unfav_sp$protein_name[which(tr_unfav_sp$ID %in% tr_wnt_sp$ID)]

7 with unfavorable prognosis.

Nucleic Acid Binding Proteins Show Many Repeat Regions

The number of detected TRs exceeds the number of analysed proteins which suggests that there are proteins with more than one TR-region.

# CRC favorable proteins
table(table(tr_fav$ID))

# CRC unfavorable proteins
table(table(tr_unfav_sp$ID))

# Wnt-pathway proteins
table(table(tr_wnt_sp$ID))

Nevertheless, most of the proteins, contain only one TR-region we detect up to 11 different TR-regions i.e. in some CRC unfavorable associated proteins.

In proteins associated with unfavorable CRC prognosis we found in general more proteins with $\geq 4$ TRs compared to the group of unfavorable or Wnt-pathway proteins.

protein_id_by_number_of_TR(tr_unfav_sp, 11)
eleven_TRs <- tr_unfav_sp[which(tr_unfav_sp$ID == protein_id_by_number_of_TR(tr_unfav_sp, 11)[[1]]),]
unique(eleven_TRs$protein_name)
unique(eleven_TRs$prot_function)

For example the splicing factor protein SFR19_HUMAN (ID: Q9H7N4) shows 11 TR-regions.

protein_id_by_number_of_TR(tr_unfav_sp, 5)
five_TRs <- tr_unfav_sp[which(tr_unfav_sp$ID == protein_id_by_number_of_TR(tr_unfav_sp, 5)[1]),]
unique(five_TRs$protein_name)
unique(five_TRs$prot_function)
five_TRs <- tr_unfav_sp[which(tr_unfav_sp$ID == protein_id_by_number_of_TR(tr_unfav_sp, 5)[2]),]
unique(five_TRs$protein_name)
unique(five_TRs$prot_function)

And Forkhead-related transcription factor 3 (ID: Q12948) and CLK4 splicing factor appear to have 5 TR-regions where the former is promoting cell growth inhibition and is involved in cell migration.

[TODO refrence to section below where it's more investigated].

Pathogenic Tandem Repeat Types Prevail

summary(tr_fav_sp$l_effective)
unique(tr_fav_sp$l_type)
summary(tr_unfav_sp$l_effective)
unique(tr_unfav_sp$l_type)
summary(tr_wnt_sp$l_effective)
unique(tr_wnt_sp$l_type)

In all three groups of proteins we could detect homo- micro- and small TRs of which homo- and micro-TRs are well known to be disease causing. We could further find domain-TR in CRC favorable proteins and in proteins associated to the Wnt-pathway.

summary(tr_fav_sp$n_effective)
summary(tr_unfav_sp$n_effective)
summary(tr_wnt_sp$n_effective)
summary(tr_fav_sp$total_repeat_length)
summary(tr_unfav_sp$total_repeat_length)
summary(tr_wnt_sp$total_repeat_length)

All TR-regions have a mean number of repeat units of 7 units per TR-region. The mean TR-region length of CRC associated proteins is about 9 amino acids in a range of 6 up to 70 and 53 for CRC favorable and unfavorable proteins respectively. In Wnt-pathway related proteins we could detect a mean repeat unit number of 7 and a TR-region length of 17 ranging from 6 to 864 amino acids.

tr_wnt_sp[which(tr_wnt_sp$total_repeat_length == max(tr_wnt_sp$total_repeat_length)),]

The clinical breast cancer marker Mucin-1 is the protein with the longest TR-region. It's expression is increased in most carcinoma cells i.a. colorectal cancer TODO TODO. It's single TR-region falls in the extracellular domain and provides many sites for O- and N-linked glycosylation. The amount of gylcosylation is increased in breast cancer cells TODO which is caused through decreased homophilic adhesion TODO, increased invasiveness TODO and protection of cytotoxic T-cell reactions. Through its TR-region MUC1 can bind to ICAM-1. It could be shown, that the large TR size contributes to increased binding properties TODO and is responsible in cell migration and metastases. The cytoplasmic tail stabilizes $\beta$-catenin through direct interaction TODO which can result in different cancer phenotypes TODO. [TODO reference to results below]

tr_wnt_sp$protein_name[which(tr_wnt_sp$total_repeat_length > 15)]
unique(tr_wnt_sp$prot_function[which(tr_wnt_sp$total_repeat_length > 15)])

Wnt-pathway associated proteins with long TR-regions are generally involved in transcriptional regulation. But also others, such as the WD repeat-containing protein 26 which negatively regulated the Wnt-pathway contains long repeat regions.

tr_fav_sp$protein_name[which(tr_fav_sp$total_repeat_length > 15)]
unique(tr_fav_sp$prot_function[which(tr_fav_sp$total_repeat_length > 15)])

Similarly proteins associated with favorable prognosis of CRC containing long TR-regions show beside transcriptional regulatory functions also roles in regulation of alternative splicing.

tr_unfav_sp$protein_name[which(tr_unfav_sp$total_repeat_length > 15)]
unique(tr_unfav_sp$prot_function[which(tr_unfav_sp$total_repeat_length > 15)])

Interestingly, the Numb-like protein (ID: Q9Y6R0, Numbl) contains a long (20 AA) poly-Q homo-TR. Its interaction through NF-$\kappa$-B upregulates Lef1-expression which enhances transcriptional activity of TCF/LEF REF.

Binding of LEF/TCF to $\beta$-Catenin is Mediated Through Tandem Repeats

# Show only a selection of variables
sel_var <- c("ID", "begin", "msa_original", "repeat_region_length", "l_type", "prot_name", "protein_name", "gene_names", "prot_function")
tr_wnt_sp[which(tr_wnt_sp$ID == "Q9UJU2"),sel_var]

In LEF1 we found a significant poly-G TR-region at the N-terminal end which falls in the binding region with $\beta$-catenin1 (positions 1-66 based on similarity TODO).

tr_wnt_sp[which(tr_wnt_sp$ID == "Q9HCS4"),sel_var]
tr_wnt_sp[which(tr_wnt_sp$ID == "P36402"),sel_var]

A similar motif could be detected in TCF-3 and TCF-7 which have also a poly-G TR-region near the N-terminus offering a presumable binding region for $\beta$-catenin1 TODO. Through this binding region, TCF-3 can mediate $\beta$-catenin degradation TODO.

$\beta$-Catenin Destruction Complex Not Only Relies On Tandem Repeats

The $\beta$-catenin destruction complex comprises different proteins, such as adenomatosis polyposis coli (APC), Axin, protein phosphatase (PP2A), glycogen synthase kinase 3 (GSK3) and casein kinase 1 (CK1).

APC $\beta$-Catenin Binding Domain Contains Tandem Repeats

tr_wnt_sp[grepl("APC", tr_wnt_sp$prot_name),sel_var]

The polyS homo-TR region in the APC protein sequence was the only repeat region which could be detected, eventhough there are structurally verified armadillo repeats near the N-terminal end of the protein sequence. However, in the very similar APC-like protein, we could detect a micro-TR in a region associated with $\beta$-catenin binding necessary for its degradation TODO.

GSK

tr_wnt_sp[grepl("GSK", tr_wnt_sp$prot_name),sel_var]

In GSK3A and GSK3B we could detect a polyG homo-TR at N-terminal proximity and a micro TR near the C-terminus respectively.

Casein Kinase I

tr_wnt_sp[grepl("KC", tr_wnt_sp$prot_name),sel_var]
tr_fav_sp[grepl("KC", tr_fav_sp$prot_name),sel_var]
tr_unfav_sp[grepl("KC", tr_unfav_sp$prot_name),sel_var]

With the applied filtering approach, no TRs were found in Casein kinase I isoform epsilon of Wnt-pathway associated or CRC favorable proteins. However, in the set of proteins CRC unfavorable, we found a small TR in the C-terminal cytoplasmic region of ATP-sensitive inward rectifier potassium channel 8 protein.

Axin Offers A Prone Site For Tandem Repeat Disorder

tr_wnt_sp[grepl("AXIN", tr_wnt_sp$prot_name),sel_var]

Mutations in AXIN1 and AXIN2 are involved with CRC pathogenesis TODO. In Axin-2 overlapping with the compositionally biased histidine region, a polyH homo-TR could be detected which falls in the region interacting with armadillo repeats in $\beta$-catenin.

TRD in this region may destabilize the binding of $\beta$-catenin to the protein-complex with Axin-2 resulting in an accumulation of $\beta$-catenin followed by a transcriptional activation of the Wnt/$\beta$-catenin target genes which is known to be a leading cause CRC.

Dishevelled Proteins

tr_wnt_sp[grepl("DVL", tr_wnt_sp$prot_name),sel_var]

Dishevelled proteins bind to the C-terminus of frizzeled proteins promoting their internalisation upon Wnt-signaling TODO. DVL-2 shows a polyP homo-TR near it's C-terminal end which could to our knowledge not be brought in contribution with Frizzled-binding.

FOXK1 and FOXK2 are positively transcriptionally regulating the Wnt-pathway and interact with DVL proteins. FOXK1/2 are reported to be highly expressed in CRC and correlate with nuclear localisation of DVL proteins TODO.

tr_wnt_sp[grepl("FOX", tr_wnt_sp$prot_name),sel_var]
tr_fav_sp[grepl("FOX", tr_fav_sp$prot_name),sel_var]
tr_unfav_sp[grepl("FOX", tr_unfav_sp$prot_name),sel_var]

In each protein FOXK1 and FOXK2 we could detect two TR-regions: A polyA homo-TR near the N-terminus and a micro-TR near the C-terminus. In FOXK1 the polyA homo-TR falls in the interaction domain with SIN3A. FOXK1 toghether with SIN3A cooperatively regulate cell cycle progression TODO.

Forkhead box protein C1 (FOXC1) in contrast is part of the CRC unfavorable associated proteins. FOXC1 acts as a transcriptional activator where the crucial domains are located at the C- and N-termini. Most of the detected TRs fall exactely in these regions responsible for transcriptional activation TODO. The C-terminal end 366-553 is the proposed binding region of a degradation factor TODO and the region 47-553 could be shown to bind to PITX2 TODO supporting the idea of TRs acting as keyplayers in protein-protein interactions. To our knowledge, no structural interaction data is available to further prove this hypothesis.

Interestingly, in both FOXK and FOXC proteins the forkhead box didn't show any TRs - they are rather clustered around the forkhead box.

tr_wnt_sp[which(tr_wnt_sp$ID == "O00358"),sel_var]
tr_unfav_sp[which(tr_unfav_sp$ID == "P10070"),sel_var]

FOXE1 is one of the prognostic colorectal cancer genes and can indirectly activate the Wnt/$\beta$-catenin pathway TODO and shows a central poly-A TR-region.

Wnt Receptor Complex

The Wnt Receptor Complex is made up of Wnt, Frizzled (Fz), G-protein coupled receptors (GPCRs), lipoprotein receptor-related protein (LRP), receptor tyrosine kinase (RTK) and ROR.

Frizzled

tr_wnt_sp[grepl("FZ", tr_wnt_sp$prot_name),sel_var]

Frizzled membran proteins are receptors of Wnt activating Dishevelled proteins. In FZD5 and FZD8 we could detect polyL and polyG homo-TR near the C- and N-terminus respectively. The polyL TR in FZD5 is near but not within the Frizzled domain and on the opposite terminus from the DVL binding site. In FZD8, the polyG TR-region is extracellular forming a helical structure which is in topological proximity to the Wnt binding site.

WNT

tr_wnt_sp[grepl("WNT", tr_wnt_sp$prot_name),sel_var]

In WNT2B and WNT6 polyL homo-TRs are located near the N-terminus in a signal peptide domain which is the important part of interaction with signal recognition particles in the process of nuclear translocation.

LRP

tr_wnt_sp[grepl("LRP", tr_wnt_sp$prot_name),sel_var]

LRP5 has a polyL homo-TR in its signal peptide at the N-terminus and a polyS homo-TR near the C-terminus. The later is also found at a similar location in LRP6. To our knowledge there is not much known today about those regions at the C-terminus. In LRP1 the N-terminal polyL homo-TR is located in a helical transmembran peptide in a possible Low-density lipoprotein receptor region. In the canonical Wnt-pathway only LRP5 and LRP6 are reported to be involved.

Receptor Tyrosine Kinase \& Tyrosine-protein kinase transmembrane receptor

tr_wnt_sp[grepl("PTK", tr_wnt_sp$prot_name),sel_var]
tr_wnt_sp[grepl("ROR", tr_wnt_sp$prot_name),sel_var]

The inactive tyrosine-protein kinase 7 (PTK7) and Tyrosine-protein kinase transmembrane receptor (ROR2) are mainly involved in the $\beta$-catenin independent planar cell polarity Wnt-pathway. PTK7 plays an important role in cancer cell invasion TODO. ROR2 is shown to suppress the Wnt/$\beta$-catenin signal and is therefore also an interesting player in the fight against cancer TODO. We couldn't detect in both of those Kinase any TRs as we couldn't in CKI and GSK3 neither. It is thus possible to hypothesise that protein kinases tend to have few TR in general.

Kinases Show No Tandem Repeat Depletion

As we couldn't detect any TRs in certain protein kinases of the Wnt/$\beta$-catenin pathway, we're interested if protein kinases generally lack TRs. We therefore filter the Human Proteome for protein kinases (enzyme commission numbers 2.7.10, 2.7.11, 2.7.12, 2.7.13, 2.7.14, 2.7.99).

# url to UniProt/Swiss-Prot querying enzyme commissions for protein kinases
url_kin <- "https://www.uniprot.org/uniprot/?query=ec:2.7.10.-%20OR%20ec:2.7.11.-%20OR%20ec:2.7.12.-%20OR%20ec:2.7.13.-%20OR%20ec:2.7.14.-%20OR%20ec:2.7.99.-&format=fasta&sort=score&fil=proteome:UP000005640%20AND%20reviewed:yes%20AND%20organism:%22Homo%20sapiens%20(Human)%20[9606]%22"
sp_kinIDs <- load_kinome(url = url_kin, path = dest_file_kin, OnlyIDs = TRUE)

# No. of protein kinases in Human Proteome
length(sp_kinIDs)
length(sp_kinIDs) / nrow(sp_all_fav)
# No. protein kinases in Wnt-Pathway
sum(sp_kinIDs %in% tr_wnt$ID)
sum(sp_kinIDs %in% tr_wnt$ID) / nrow(tr_wnt)
# No. protein kinases in CRC favorable proteins
sum(sp_kinIDs %in% tr_fav$ID)
# No. protein kinases in CRC unfavorable proteins
sum(sp_kinIDs %in% tr_unfav$ID)

(sum(sp_kinIDs %in% tr_fav$ID) + sum(sp_kinIDs %in% tr_unfav$ID))/(nrow(tr_fav)+nrow(tr_unfav))

Of the 483 protein kinases from the Human proteome 11 have TRs and are associated with the Wnt-pathway. 7 protein kinases have TRs and appear in the set of CRC related proteins. To put this in relations, of the whole human proteome 2% are protein kinases. From proteins with TRs, 4% belong to the enzyme class of protein kinases in CRC associated proteins and 6% in Wnt-pathway proteins.

dest_file_kinext <- paste0(base_path, "data/swissprot_human_extkinome.tsv")
url_extkin <- "https://www.uniprot.org/uniprot/?query=ec:2.7.-.-%20AND%20reviewed:yes%20AND%20organism:%22Homo%20sapiens%20(Human)%20[9606]%22%20AND%20proteome:up000005640&format=fasta&sort=score"
sp_extkinIDS <- load_kinome(url = url_extkin, path = dest_file_kinext, OnlyIDs = TRUE)
# No. of protein kinases in Human Proteome
length(sp_extkinIDS)
length(sp_extkinIDS) / nrow(sp_all_fav)
# No. protein kinases in Wnt-Pathway
sum(sp_extkinIDS %in% tr_wnt$ID)
sum(sp_extkinIDS %in% tr_wnt$ID) / nrow(tr_wnt)
# No. protein kinases in CRC favorable proteins
sum(sp_extkinIDS %in% tr_fav$ID)
# No. protein kinases in CRC unfavorable proteins
sum(sp_extkinIDS %in% tr_unfav$ID)

(sum(sp_extkinIDS %in% tr_fav$ID) + sum(sp_extkinIDS %in% tr_unfav$ID))/(nrow(tr_fav)+nrow(tr_unfav))

These findings don't support our hypothesis and even considering all phosphorous transfering enzymes (EC 2.7.-.-) doesn't show a depletion of TRs.

Conclusion

With the aim to provide a functional analysis of TRs in CRC associated proteins and Wnt-pathway components, we could find TR-regions beeing potentially involved in CRC development. Proteins, expressed by genes with unfavorable CRC prognosis according to the human protein atlas TODO show more TRs, than those with favorable CRC prognosis. [TODO: WHY?]

In our previous study TODO we could show, that homo-TRs cluster towards the flanks of protein sequences. We now found that especially extracellular or membran located proteins show a tendency for homo-TRs which are mostly located in their signal peptide sequence. For example in WNT2B and WNT6 or LRP5/6 proteins which show polyL homo-TR in signal peptides.

Beside supporting previously known pathogenic TR-regions, we could detect possible new CRC driver TRs. Where it already could be shown by TODO, that the domain-TR length in Mucin-1 is correlating with its protein-protein interaction affinity and is responsible for cell migration and metastases we could support this here and in general in our previous findings in TODO. We further found in Numbl a long polyQ homo-TR region. Those kind of TR are prone site for diseases caused by tandem repeat disorders (TRD). Numbl is further a negative regulator of NF-$\kappa$-B pathway. Negative regulation of NF-$\kappa$-B pathway is known as a source of CRC through interaction with Wnt/$\beta$-catenin pathway. One possible interaction leading to CRC is through NF-$\kappa$-B upregulated Lef1-expression which enhances transcriptional activity of TCF/LEF REF. In the binding region of LEF1, TCF-3 and TCF7 with $\beta$-catenin, we found a polyG homo-TRs. LEF1 has been identified as beeing only expressed in CRC cells but not in healthy colon tissue. However, it could be detected that different splicing variants in alteri result in the loss of the $\beta$-catenin binding site which acts as a transcriptional repressor often expressed in healthy colon epithelial cells TODO.

Taken together, these results suggest that TR regions may play important roles in cancer hallmark functions and provide a detailed view in their structural mechanisms.

This study lays the groundwork for future research into a systemic analysis of functional mechanisms of protein TRs. Since we limited our detailed analysis to one of many pathways involved in CRC, a natural progression of this work is to analogously investigate on other pathways to provide a systemic overview of the interaction of CRC proteins focusing on TRs.

NOTES \& THOUGHTS

Conclusion:

Many of the detected TRs fall in regions which are not yet characterized but may play important roles in signaling and protein-protein interaction.

Only Few Wnt-pathway Proteins Are Associated With CRC. Nucleic Acid Binding Proteins Show Many Repeat Regions Binding of LEF/TCF to β-Catenin is Mediated Through Tandem Repeats β -Catenin Destruction Complex Not Only Relies On Tandem Repeats Axin Offers A Prone Site For Tandem Repeat Disorder -> FOXC Kinases Show No Tandem Repeat Depletion

Filter the top 20 CRC genes:

top20_unfav_genes <- c("LRCH4", "POFUT2", "CLK3", "EGFL7", "DPP7", "HSH2D", "ASB6", "SPAG4", "EXOC3L4", "HSPA1A", "PAQR6", "FAM69B", "CRACR2B", "ARHGAP4", "NPDC1", "DAPK1", "CNPY3", "ARL8A", "INAFM1", "RHBDD2")
(getTRbyGene(tr_sp = tr_unfav_sp, genes = top20_unfav_genes))

top20_fav_genes <- c("RBM3", "NOL11", "USP53", "TEX2", "HOOK1", "ZYG11B", "HSPA8", "DLAP", "SORT1", "DDX46", "FBXO7", "ABCD3", "NGLY1", "PARS2", "CLCC1", "AP3B1", "PRPSAP1", "PSMA5", "GRSF1", "CD274")
(getTRbyGene(tr_fav_sp, genes = top20_fav_genes))

compare to results from paulina

She found those genes resp. proteins to have TRs

lina_top6_genes <- c("CDX2", "CLK3", "CNPY3", "CRACR2B", "DPP7", "HSH2D")
lina_top6_prots <- c("Q9NQX5", "H3BVF8", "A0A0C4DFY8", "Q8N4Y2", "R4GMV4", "Q96JZ2")

Not all of them appear in the list of genes from \url{https://www.proteinatlas.org/humanproteome/pathology/colorectal+cancer#colorectal%20cancerunfavourable} (22. Mai 2019)

lina_top6_genes %in% top20_unfav_genes

And we couldn't find TRs in all proteins mentioned in her thesis. Only in this subset:

(getTRbyGene(tr_sp = tr_unfav_sp, genes = lina_top6_genes)[1])

Comparative analysis of favorable TRs and unfavorable

TODO!!!!!!!!!

Disorderpropensity Meta Data

Disorderpropensity per Protein

TODO: Add what we want here

Start with determining the Amino Acid distribution of the TR Sequence

# (aa_freq_TR <- AAfreq_in_TR(Q9UGU0))

We sort them according their disorderpropensity. You find more information in Uversky et al. ( http://www.tandfonline.com/doi/full/10.4161/idp.24684).

# # Not mentioned in Uversky's paper: "B", "O", "U", "Z", "X". These guys might need to fit in with the rest (if possible, as some of them represent multiple aa.)
# aa_order_promoting_to_disorder_promoting = c("C", "W", "I", "Y", "F", "L", "H", "V", "N", "M", "R", "T", "D", "G", "A", "K", "Q", "S", "E", "P", "B", "O", "U", "Z", "X")
# # Sort AA according to their disorder promoting potential
# aa_freq_TR <- aa_freq_TR[match(aa_order_promoting_to_disorder_promoting, aa_freq_TR$aa),]
# colnames(aa_freq_TR) <- c("aa_freq_tr", "aa", "aa_ratio_tr")

We then load the dataset with information about the amino acid frequency of all swissprot proteins and the disorderpropensity of each amino acid. This data is included in the package and is generated through the script "/data-raw/create_AAfreqSP.R". Rerunning this script, updates the amino acid frequency of swissprot proteins.

# data("AAfreqSP")

We can now download the whole protein sequence and compare the Amino acid frequency compared to that of the overall protein

# prot_seq <- download_prot_sequence("Q9UGU0")
# aa_freq_prot <- AAfreq_in_prot(prot_seq)
# (aa_freq_prot <- aa_freq_prot[match(aa_order_promoting_to_disorder_promoting, aa_freq_prot$aa),])
# colnames(aa_freq_prot) <- c("aa_freq_sp", "aa", "aa_ratio_sp")

Plot our amino acid abundancy in the TRs of this protein against the amino acid abundancy in the whole protein, we see that AA with medium to high disorderpropensity fall in the TR sequence.

# # we combine the datasets
# aa_freq_prot <- aa_freq_prot[1:20,]
# disorderpropensity <- AAfreqSP$disorderpropensity
# df <- cbind(aa_freq_TR[1:20,], disorderpropensity, aa_freq_prot)
# df <- df[ , !(names(df) %in% c("aa"))]
# 
# p <- ggplot(df, aes(x= aa_ratio_sp, y = aa_ratio_tr, size = disorderpropensity))+
#   geom_point()+
#   labs(x= "AA Background Frequency",
#        y = "AA Frequency in TRs")+
#   guides(size=guide_legend(title="Disorderpropensity"))+
#   theme_minimal()
# p <- beautifier(p, x.axis.text.angle = 0)
# p

Disorderprotpensity per Set of Proteins



matteodelucchi/TRAL-Result-Analysis documentation built on Dec. 2, 2019, 11:42 p.m.