gclink: Gene-Cluster Discovery, Annotation and Visualization
In gclink: Gene-Cluster Discovery, Annotation and Visualization

gclink

R Documentation

Gene-Cluster Discovery, Annotation and Visualization

Description

gclink() performs all steps from raw blastp output to a publication-ready gene-cluster plot in one call. It dynamically adapts its workflow:

If in_seq_data is provided: Extracts coordinates/sequences, merges data, and generates plots.
If in_seq_data is NULL: Skips sequence-dependent steps, returning only the cluster table.
If in_GC_group is NULL: Omits functional-group merging and plotting.
Supports custom gene lists (e.g., photosynthesis, viral genes) via in_gene_list for universal cluster detection in bacteria, archaea, or phages.

Gene clusters are identified via a density threshold (AllGeneNum and MinConSeq) due to incomplete gene annotation coverage.

Usage

gclink(
  in_blastp_df = blastp_df,
  in_seq_data = seq_data,
  in_gene_list = photosynthesis_gene_list,
  in_GC_group = PGC_group,
  AllGeneNum = 50,
  MinConSeq = 25,
  apply_length_filter = FALSE,
  down_IQR = 10,
  up_IQR = 10,
  apply_evalue_filter = FALSE,
  min_evalue = 1,
  apply_score_filter = FALSE,
  min_score = 10,
  orf_before_first = 0,
  orf_after_last = 0,
  orf_range = "All",
  levels_gene_group = c("bch", "puh", "puf", "crt", "acsF", "assembly", "regulator",
    "hypothetical ORF"),
  color_theme = c("#3BAA51", "#6495ED", "#DD2421", "#EF9320", "#F8EB00", "#FF0683",
    "#956548", "grey"),
  genome_subset = NULL
)

Arguments

`in_blastp_df`	A `data.frame` from blast/blastp-style output with columns: `qaccver`: Genome + contig name (separated by `"---"`), e.g., `"Kuafubacteriaceae--GCA_016703535.1---JADJBV010000001.1_150"`: Genome: `"Kuafubacteriaceae--GCA_016703535.1"` (`"--"` separator), Contig: `"JADJBV010000001.1"`, ORF: `"JADJBV010000001.1_150"` (`"_"` separator), Position: `"150"`. `saccver`: Gene name + metadata (separated by `"_"`), e.g., `"bchC_Methyloversatilis_sp_RAC08_BSY238_2447_METR"`: Gene: `"bchC"`, Metadata(Optional): `"Methyloversatilis_sp_RAC08_BSY238_2447_METR"`. eggNOG (evolutionary gene genealogy Nonsupervised Orthologous Groups) format results are supported by renaming annotation columns (e.g., `"GOs"`) to `saccver`. Default: `blastp_df`.
`in_seq_data`	`NULL` or a `data.frame` with: `SeqName` ORF identifier (Prodigal format: `⁠>ORF_id # start # end # strand # ...⁠`). `Sequence` ORF sequence. Example: `"Kuafubacteriaceae--GCA_016703535.1---JADJBV010000001.1_1 # 74 # 1018 # 1 # ..."` Can be imported from Prodigal FASTA using: seq_data <- Biostrings::readBStringSet("Prodigal.fasta",format="fasta", nrec=-1L, skip=0L, seek.first.rec=FALSE, use.names=TRUE) %>% data.frame(Sequence = .) %>% tibble::rownames_to_column("SeqName") `NULL` skips coordinate extraction and plotting. Default: `seq_data`.
`in_gene_list`	Character vector of reference genes for cluster detection. Default: `photosynthesis_gene_list`.
`in_GC_group`	`NULL` or a `data.frame` mapping genes to functional groups (columns: `gene`, `gene_group`). `NULL` skips plotting. Default: `PGC_group`.
`AllGeneNum`	Integer; max ORFs per cluster. Default: `50`.
`MinConSeq`	Integer; min consecutive reference genes per cluster. Default: `25`.
`apply_length_filter`	Logical; whether to apply length filtering based on IQR. If `FALSE`, skips `down_IQR` and `up_IQR` filtering. Default: `FALSE`.
`down_IQR`	Numeric; lower-bound scale factor for IQR length filtering (see `length_filter`). Default: `10`.
`up_IQR`	Numeric; upper-bound scale factor for IQR length filtering (see `length_filter`). Default: `10`.
`apply_evalue_filter`	Logical; whether to filter BLAST hits by e-value cutoff. Requires column `evalue` in input data. Default: `FALSE`.
`min_evalue`	Numeric; maximum e-value threshold for retaining BLAST hits. Hits with e-value > cutoff will be removed. Typical values: 1e-3 (loose) to 1e-10 (strict). Default: `1`.
`apply_score_filter`	Logical; whether to filter BLAST hits by bitscore cutoff. Requires column `bitscore` in input data. Default: `FALSE`.
`min_score`	Numeric; minimum bitscore threshold for retaining BLAST hits. Hits with bitscore < cutoff will be removed. For short proteins (~100 aa), 30-40 is typical; for long proteins (>300 aa), 50-100 is recommended. Default: `10`.
`orf_before_first`	Integer; number of hypothetical ORFs to insert before the first gene of each detected cluster. Useful when the upstream annotation is incomplete or low-confidence; insertion is bounded by the first ORF of the contig. Default: `0`.
`orf_after_last`	Integer; number of hypothetical ORFs to append after the last gene of each detected cluster. Helpful when the downstream annotation is incomplete or low-confidence; insertion is bounded by the last ORF of the contig. Default: `0`.
`orf_range`	Character. ORF inclusion mode: `"All"`: Include all ORFs + annotations (default) `"OnlyAnnotated"`: Keep only annotated ORFs `"IgnoreAnnotated"`: Include all ORFs without annotation merging
`levels_gene_group`	Character vector; factor levels for gene-group legends (must include "hypothetical ORF" in case some genes remain unclassified). Ignored if `in_GC_group` is `NULL`. Default: `c('bch','puh','puf','crt','acsF','assembly','regulator','hypothetical ORF')`.
`color_theme`	Character vector; colours for `gc_plot`, matched to the length and order of `levels_gene_group`. Ignored if plotting is skipped. Default: `c('#3BAA51','#6495ED','#DD2421','#EF9320','#F8EB00','#FF0683','#956548','grey')`.
`genome_subset`	Character vector or `NULL`; genomes to retain. If `NULL`, all genomes are retained. Default: `NULL`.

Value

A named list with:

GC_meta: Annotated cluster table (data.frame).
GC_seq: FASTA sequences (if in_seq_data provided).
GC_plot: ggplot object (if in_seq_data and in_GC_group provided).

Examples

   # Case 1: Using blastp result with Full pipeline (Find Cluster + Extract FASTA + Plot Cluster)
   data(blastp_df)
   data(seq_data)
   data(photosynthesis_gene_list)
   data(PGC_group)
   gc_list <- gclink(in_blastp_df = blastp_df,
                     in_seq_data = seq_data,
                     in_gene_list = photosynthesis_gene_list,
                     in_GC_group  = PGC_group,
                     AllGeneNum = 50,
                     MinConSeq  = 25,
                     apply_length_filter = TRUE,
                     down_IQR   = 10,
                     up_IQR     = 10,
                     orf_before_first = 0,
                     orf_after_last = 0,
                     levels_gene_group = c('bch','puh','puf','crt','acsF','assembly','regulator',
                                           'hypothetical ORF'),
                     color_theme = c('#3BAA51','#6495ED','#DD2421','#EF9320','#F8EB00',
                                     '#FF0683','#956548','grey'),
                     genome_subset = NULL)
   gc_meta = gc_list[["GC_meta"]]
   gc_seq = gc_list[["GC_seq"]]
   gc_plot = gc_list[["GC_plot"]]
   head(gc_meta)
   head(gc_seq)
   print(gc_plot)

   # Case 2: Using eggNOG result with Full pipeline (Find Cluster + Extract FASTA + Plot Cluster)
   data(eggnog_df)
   data(seq_data)
   data(KO_group)
   KOs = c("K02291","K09844","K20611","K13789",
           "K09846","K08926","K08927","K08928",
           "K08929","K13991","K04035","K04039",
           "K11337","K03404","K11336","K04040",
           "K03403","K03405","K04037","K03428",
           "K04038","K06049","K10960","K11333",
           "K11334","K11335","K08226","K08226",
           "K09773")
   rename_KOs = paste0("ko:", KOs)
   eggnog_df$qaccver = eggnog_df$`#query`
   eggnog_df$saccver = eggnog_df$KEGG_ko
   eggnog_df$evalue = eggnog_df$evalue
   eggnog_df$bitscore = eggnog_df$score
   eggnog_df$gene = eggnog_df$KEGG_ko
   gc_list_2 = gclink(in_blastp_df = eggnog_df,
                      in_seq_data = seq_data,
                      in_gene_list = rename_KOs,
                      in_GC_group  = KO_group,
                      AllGeneNum = 50,
                      MinConSeq  = 25,
                      apply_evalue_filter = FALSE,
                      min_evalue = 1,
                      apply_score_filter = TRUE,
                      min_score = 10,
                      orf_before_first = 1,
                      orf_after_last = 1,
                      levels_gene_group = c('bch','puh','puf','crt',
                                            'acsF','assembly','hypothetical ORF'),
                      color_theme = c('#3BAA51','#6495ED','#DD2421','#EF9320',
                                      '#F8EB00','#FF0683','grey'))
   gc_meta_2 = gc_list_2[["GC_meta"]]
   gc_seq_2 = gc_list_2[["GC_seq"]]
   gc_plot_2 = gc_list_2[["GC_plot"]]
   head(gc_meta_2)
   head(gc_seq_2)
   print(gc_plot_2)

   # SOLUTION FOR "Viewport has zero dimension(s)" ERROR in print(gc_plot)
   # --------------------------------------------------
   # When you see this error, try these 3 solutions:

   # 1. RESIZE RSTUDIO PLOT PANEL (Quickest fix)
   # Simply click and drag the right border of the plot panel wider

   # 2. LAUNCH IN NEW WINDOW
   # dev.new()
   # print(gc_plot)

   # 3. SAVE DIRECTLY TO FILE
   # ggplot2::ggsave('gc_plot.pdf', gc_plot, width=20,height=3)

gclink documentation built on Sept. 9, 2025, 5:39 p.m.