gclink: Gene-Cluster Discovery, Annotation and Visualization

View source: R/gclink.R

gclinkR Documentation

Description

gclink() performs all steps from raw blastp output to a publication-ready gene-cluster plot in one call. It dynamically adapts its workflow:

  • If in_seq_data is provided: Extracts coordinates/sequences, merges data, and generates plots.

  • If in_seq_data is NULL: Skips sequence-dependent steps, returning only the cluster table.

  • If in_GC_group is NULL: Omits functional-group merging and plotting.

  • Supports custom gene lists (e.g., photosynthesis, viral genes) via in_gene_list for universal cluster detection in bacteria, archaea, or phages.

Gene clusters are identified via a density threshold (AllGeneNum and MinConSeq) due to incomplete gene annotation coverage.

Usage

gclink(
  in_blastp_df = blastp_df,
  in_seq_data = seq_data,
  in_gene_list = photosynthesis_gene_list,
  in_GC_group = PGC_group,
  AllGeneNum = 50,
  MinConSeq = 25,
  apply_length_filter = FALSE,
  down_IQR = 10,
  up_IQR = 10,
  apply_evalue_filter = FALSE,
  min_evalue = 1,
  apply_score_filter = FALSE,
  min_score = 10,
  orf_before_first = 0,
  orf_after_last = 0,
  orf_range = "All",
  levels_gene_group = c("bch", "puh", "puf", "crt", "acsF", "assembly", "regulator",
    "hypothetical ORF"),
  color_theme = c("#3BAA51", "#6495ED", "#DD2421", "#EF9320", "#F8EB00", "#FF0683",
    "#956548", "grey"),
  genome_subset = NULL
)

Arguments

in_blastp_df

A data.frame from blast/blastp-style output with columns:

  • qaccver: Genome + contig name (separated by "---"), e.g., "Kuafubacteriaceae--GCA_016703535.1---JADJBV010000001.1_150":

    • Genome: "Kuafubacteriaceae--GCA_016703535.1" ("--" separator),

    • Contig: "JADJBV010000001.1",

    • ORF: "JADJBV010000001.1_150" ("_" separator),

    • Position: "150".

  • saccver: Gene name + metadata (separated by "_"), e.g., "bchC_Methyloversatilis_sp_RAC08_BSY238_2447_METR":

    • Gene: "bchC",

    • Metadata(Optional): "Methyloversatilis_sp_RAC08_BSY238_2447_METR".

eggNOG (evolutionary gene genealogy Nonsupervised Orthologous Groups) format results are supported by renaming annotation columns (e.g., "GOs") to saccver. Default: blastp_df.

in_seq_data

NULL or a data.frame with:

SeqName

ORF identifier (Prodigal format: ⁠>ORF_id # start # end # strand # ...⁠).

Sequence

ORF sequence.

Example: "Kuafubacteriaceae--GCA_016703535.1---JADJBV010000001.1_1 # 74 # 1018 # 1 # ..." Can be imported from Prodigal FASTA using:

seq_data <- Biostrings::readBStringSet("Prodigal.fasta",format="fasta", nrec=-1L, skip=0L, seek.first.rec=FALSE, use.names=TRUE) %>%
  data.frame(Sequence = .) %>%
  tibble::rownames_to_column("SeqName")

NULL skips coordinate extraction and plotting. Default: seq_data.

in_gene_list

Character vector of reference genes for cluster detection. Default: photosynthesis_gene_list.

in_GC_group

NULL or a data.frame mapping genes to functional groups (columns: gene, gene_group). NULL skips plotting. Default: PGC_group.

AllGeneNum

Integer; max ORFs per cluster. Default: 50.

MinConSeq

Integer; min consecutive reference genes per cluster. Default: 25.

apply_length_filter

Logical; whether to apply length filtering based on IQR. If FALSE, skips down_IQR and up_IQR filtering. Default: FALSE.

down_IQR

Numeric; lower-bound scale factor for IQR length filtering (see length_filter). Default: 10.

up_IQR

Numeric; upper-bound scale factor for IQR length filtering (see length_filter). Default: 10.

apply_evalue_filter

Logical; whether to filter BLAST hits by e-value cutoff. Requires column evalue in input data. Default: FALSE.

min_evalue

Numeric; maximum e-value threshold for retaining BLAST hits. Hits with e-value > cutoff will be removed. Typical values: 1e-3 (loose) to 1e-10 (strict). Default: 1.

apply_score_filter

Logical; whether to filter BLAST hits by bitscore cutoff. Requires column bitscore in input data. Default: FALSE.

min_score

Numeric; minimum bitscore threshold for retaining BLAST hits. Hits with bitscore < cutoff will be removed. For short proteins (~100 aa), 30-40 is typical; for long proteins (>300 aa), 50-100 is recommended. Default: 10.

orf_before_first

Integer; number of hypothetical ORFs to insert before the first gene of each detected cluster. Useful when the upstream annotation is incomplete or low-confidence; insertion is bounded by the first ORF of the contig. Default: 0.

orf_after_last

Integer; number of hypothetical ORFs to append after the last gene of each detected cluster. Helpful when the downstream annotation is incomplete or low-confidence; insertion is bounded by the last ORF of the contig. Default: 0.

orf_range

Character. ORF inclusion mode:

  • "All": Include all ORFs + annotations (default)

  • "OnlyAnnotated": Keep only annotated ORFs

  • "IgnoreAnnotated": Include all ORFs without annotation merging

levels_gene_group

Character vector; factor levels for gene-group legends (must include "hypothetical ORF" in case some genes remain unclassified). Ignored if in_GC_group is NULL. Default: c('bch','puh','puf','crt','acsF','assembly','regulator','hypothetical ORF').

color_theme

Character vector; colours for gc_plot, matched to the length and order of levels_gene_group. Ignored if plotting is skipped. Default: c('#3BAA51','#6495ED','#DD2421','#EF9320','#F8EB00','#FF0683','#956548','grey').

genome_subset

Character vector or NULL; genomes to retain. If NULL, all genomes are retained. Default: NULL.

Value

A named list with:

GC_meta

Annotated cluster table (data.frame).

GC_seq

FASTA sequences (if in_seq_data provided).

GC_plot

ggplot object (if in_seq_data and in_GC_group provided).

Examples

   # Case 1: Using blastp result with Full pipeline (Find Cluster + Extract FASTA + Plot Cluster)
   data(blastp_df)
   data(seq_data)
   data(photosynthesis_gene_list)
   data(PGC_group)
   gc_list <- gclink(in_blastp_df = blastp_df,
                     in_seq_data = seq_data,
                     in_gene_list = photosynthesis_gene_list,
                     in_GC_group  = PGC_group,
                     AllGeneNum = 50,
                     MinConSeq  = 25,
                     apply_length_filter = TRUE,
                     down_IQR   = 10,
                     up_IQR     = 10,
                     orf_before_first = 0,
                     orf_after_last = 0,
                     levels_gene_group = c('bch','puh','puf','crt','acsF','assembly','regulator',
                                           'hypothetical ORF'),
                     color_theme = c('#3BAA51','#6495ED','#DD2421','#EF9320','#F8EB00',
                                     '#FF0683','#956548','grey'),
                     genome_subset = NULL)
   gc_meta = gc_list[["GC_meta"]]
   gc_seq = gc_list[["GC_seq"]]
   gc_plot = gc_list[["GC_plot"]]
   head(gc_meta)
   head(gc_seq)
   print(gc_plot)

   # Case 2: Using eggNOG result with Full pipeline (Find Cluster + Extract FASTA + Plot Cluster)
   data(eggnog_df)
   data(seq_data)
   data(KO_group)
   KOs = c("K02291","K09844","K20611","K13789",
           "K09846","K08926","K08927","K08928",
           "K08929","K13991","K04035","K04039",
           "K11337","K03404","K11336","K04040",
           "K03403","K03405","K04037","K03428",
           "K04038","K06049","K10960","K11333",
           "K11334","K11335","K08226","K08226",
           "K09773")
   rename_KOs = paste0("ko:", KOs)
   eggnog_df$qaccver = eggnog_df$`#query`
   eggnog_df$saccver = eggnog_df$KEGG_ko
   eggnog_df$evalue = eggnog_df$evalue
   eggnog_df$bitscore = eggnog_df$score
   eggnog_df$gene = eggnog_df$KEGG_ko
   gc_list_2 = gclink(in_blastp_df = eggnog_df,
                      in_seq_data = seq_data,
                      in_gene_list = rename_KOs,
                      in_GC_group  = KO_group,
                      AllGeneNum = 50,
                      MinConSeq  = 25,
                      apply_evalue_filter = FALSE,
                      min_evalue = 1,
                      apply_score_filter = TRUE,
                      min_score = 10,
                      orf_before_first = 1,
                      orf_after_last = 1,
                      levels_gene_group = c('bch','puh','puf','crt',
                                            'acsF','assembly','hypothetical ORF'),
                      color_theme = c('#3BAA51','#6495ED','#DD2421','#EF9320',
                                      '#F8EB00','#FF0683','grey'))
   gc_meta_2 = gc_list_2[["GC_meta"]]
   gc_seq_2 = gc_list_2[["GC_seq"]]
   gc_plot_2 = gc_list_2[["GC_plot"]]
   head(gc_meta_2)
   head(gc_seq_2)
   print(gc_plot_2)

   # SOLUTION FOR "Viewport has zero dimension(s)" ERROR in print(gc_plot)
   # --------------------------------------------------
   # When you see this error, try these 3 solutions:

   # 1. RESIZE RSTUDIO PLOT PANEL (Quickest fix)
   # Simply click and drag the right border of the plot panel wider

   # 2. LAUNCH IN NEW WINDOW
   # dev.new()
   # print(gc_plot)

   # 3. SAVE DIRECTLY TO FILE
   # ggplot2::ggsave('gc_plot.pdf', gc_plot, width=20,height=3)

gclink documentation built on Sept. 9, 2025, 5:39 p.m.