getChromInfoFromUCSC: Get chromosome information for a UCSC genome

View source: R/getChromInfoFromUCSC.R

getChromInfoFromUCSCR Documentation

Get chromosome information for a UCSC genome

Description

getChromInfoFromUCSC returns chromosome information like sequence names, lengths and circularity flags for a given UCSC genome e.g. for hg19, panTro6, sacCer3, etc...

Note that getChromInfoFromUCSC behaves slightly differently depending on whether a genome is registered in the GenomeInfoDb package or not. See below for the details.

Use registered_UCSC_genomes to list all the UCSC genomes currently registered in the GenomeInfoDb package.

Usage

getChromInfoFromUCSC(genome,
                     assembled.molecules.only=FALSE,
                     map.NCBI=FALSE,
                     add.ensembl.col=FALSE,
                     goldenPath.url=getOption("UCSC.goldenPath.url"),
                     recache=FALSE,
                     as.Seqinfo=FALSE)

registered_UCSC_genomes(organism=NA)

Arguments

genome

A single string specifying the name of a UCSC genome e.g. "panTro6", "mm39", "sacCer3", etc...

assembled.molecules.only

If FALSE (the default) then chromosome information is returned for all the sequences in the genome, that is, for all the chromosomes, plasmids, and scaffolds.

If TRUE then chromosome information is returned only for the assembled molecules. These are the chromosomes (including the mitochondrial chromosome) and plasmids only. No scaffolds.

Note that assembled.molecules.only=TRUE is supported only for registered genomes. When used on an unregistered genome, assembled.molecules.only is ignored with a warning.

map.NCBI

TRUE or FALSE (the default).

If TRUE then NCBI chromosome information is bound to the result. This information is retrieved from NCBI by calling getChromInfoFromNCBI on the NCBI assembly that the UCSC genome is based on. Then the data frame returned by getChromInfoFromNCBI ("NCBI chrom info") is mapped and bound to the data frame returned by getChromInfoFromUCSC ("UCSC chrom info"). This "map and bind" operation is similar to a JOIN in SQL.

Note that not all rows in the "UCSC chrom info" data frame are necessarily mapped to a row in the "NCBI chrom info" data frame. For example chrM in hg19 has no corresponding sequence in the GRCh37 assembly (the mitochondrial chromosome was omitted from GRCh37). For the unmapped rows the NCBI columns in the final data frame are filled with NAs (LEFT JOIN in SQL).

The primary use case for using map.NCBI=TRUE is to map UCSC sequence names to NCBI sequence names. This is only supported for registered UCSC genomes based on an NCBI assembly!

add.ensembl.col

TRUE or FALSE (the default). Whether or not the Ensembl sequence names should be added to the result (in column ensembl).

goldenPath.url

A single string specifying the URL to the UCSC goldenPath location where the chromosome sizes are expected to be found.

recache

getChromInfoFromUCSC uses a cache mechanism so the chromosome sizes of a given genome only get downloaded once during the current R session (note that the caching is done in memory so cached information does NOT persist across sessions). Setting recache to TRUE forces a new download (and recaching) of the chromosome sizes for the specified genome.

as.Seqinfo

TRUE or FALSE (the default). If TRUE then a Seqinfo object is returned instead of a data frame. Note that only the chrom, size, and circular columns of the data frame are used to make the Seqinfo object. All the other columns are ignored (and lost).

organism

When organism is specified, registered_UCSC_genomes() will only return the subset of genomes that are registered for that organism. organism must be specified as a single string and will be used to perform a search (with grep()) on the "organism" column of the data frame returned by registered_UCSC_genomes(). The search is case-insensitive.

Details

*** Registered vs unregistered UCSC genomes ***

  • For registered genomes, the returned data frame contains information about which sequences are assembled molecules and which are not, and the assembled.molecules.only argument is supported. For unregistered genomes, this information is missing, and the assembled.molecules.only argument is ignored with a warning.

  • For registered genomes, the returned circularity flags are guaranteed to be accurate. For unregistered genomes, a heuristic is used to determine the circular sequences.

  • For registered genomes, special care is taken to make sure that the sequences are returned in a sensible order. For unregistered genomes, a heuristic is used to return the sequences in a sensible order.

Please contact the maintainer of the GenomeInfoDb package to request registration of additional genomes.

*** Offline mode ***

getChromInfoFromUCSC() supports an "offline mode" when called with assembled.molecules.only=TRUE, but only for a selection of registered genomes. The "offline mode" works thanks to a collection of tab-delimited files stored in the package, that contain the "assembled molecules info" for the supported genomes. This makes calls like:

    getChromInfoFromUCSC("hg38", assembled.molecules.only=TRUE)

fast and reliable i.e. the call will always work, even when offline!

See README.TXT in GenomeInfoDb/inst/extdata/assembled_molecules_db/UCSC/ for more information.

Note that calling getChromInfoFromUCSC() with assembled.molecules.only=FALSE (the default), or with recache=TRUE, will trigger retrieval of the chromosome info from UCSC, and will issue a warning if this info no longer matches the "assembled molecules info" stored in the package.

Please contact the maintainer of the GenomeInfoDb package to request genome additions to the "offline mode".

Value

For getChromInfoFromUCSC: By default, a 4-column data frame with columns:

  1. chrom: character.

  2. size: integer.

  3. assembled: logical.

  4. circular: logical.

If map.NCBI is TRUE, then 7 "NCBI columns" are added to the result:

  • NCBI.SequenceName: character.

  • NCBI.SequenceRole: factor.

  • NCBI.AssignedMolecule: factor.

  • NCBI.GenBankAccn: character.

  • NCBI.Relationship: factor.

  • NCBI.RefSeqAccn: character.

  • NCBI.AssemblyUnit: factor.

Note that the names of the "NCBI columns" are those returned by getChromInfoFromNCBI but with the NCBI. prefix added to them.

If add.ensembl.col is TRUE, the column ensembl is added to the result.

For registered_UCSC_genomes: A data frame summarizing all the UCSC genomes currently registered in the GenomeInfoDb package.

Author(s)

H. Pagès

See Also

  • getChromInfoFromNCBI for getting chromosome information for an NCBI assembly.

  • getChromInfoFromEnsembl for getting chromosome information for an Ensembl species.

  • Seqinfo objects.

  • The getBSgenome convenience utility in the BSgenome package for getting a BSgenome object from an installed BSgenome data package.

Examples

## ---------------------------------------------------------------------
## A. BASIC EXAMPLES
## ---------------------------------------------------------------------

## --- Internet access required! ---

getChromInfoFromUCSC("hg19")

getChromInfoFromUCSC("hg19", as.Seqinfo=TRUE)

## Map the hg38 sequences to their corresponding sequences in
## the GRCh38.p13 assembly:
getChromInfoFromUCSC("hg38", map.NCBI=TRUE)[c(1, 5)]

## Note that some NCBI-based UCSC genomes contain sequences that
## are not mapped. For example this is the case for chrM in hg19:
hg19 <- getChromInfoFromUCSC("hg19", map.NCBI=TRUE)
hg19[is.na(hg19$NCBI.SequenceName), ]

## Map the hg19 sequences to the Ensembl sequence names:
getChromInfoFromUCSC("hg19", add.ensembl.col=TRUE)

## --- No internet access required! (offline mode) ---

getChromInfoFromUCSC("hg19", assembled.molecules.only=TRUE)

getChromInfoFromUCSC("panTro6", assembled.molecules.only=TRUE)

getChromInfoFromUCSC("bosTau9", assembled.molecules.only=TRUE)

## --- List of UCSC genomes currently registered in the package ---

registered_UCSC_genomes()

## All registered UCSC genomes for Felis catus (domestic cat):
registered_UCSC_genomes(organism = "Felis catus")

## All registered UCSC genomes for Homo sapiens:
registered_UCSC_genomes("homo")

## ---------------------------------------------------------------------
## B. USING getChromInfoFromUCSC() TO SET UCSC SEQUENCE NAMES ON THE
##    GRCh38 GENOME
## ---------------------------------------------------------------------

## Load the BSgenome.Hsapiens.NCBI.GRCh38 package:
library(BSgenome)
genome <- getBSgenome("GRCh38")  # this loads the
                                 # BSgenome.Hsapiens.NCBI.GRCh38 package
genome

## Get the chromosome info for the hg38 genome:
hg38_chrom_info <- getChromInfoFromUCSC("hg38", map.NCBI=TRUE)
ncbi2ucsc <- setNames(hg38_chrom_info$chrom,
                      hg38_chrom_info$NCBI.SequenceName)

## Set the UCSC sequence names on 'genome':
seqlevels(genome) <- ncbi2ucsc[seqlevels(genome)]
genome

## Sanity check: check that the sequence lengths in 'genome' are the same
## as in 'hg38_chrom_info':
m <- match(seqlevels(genome), hg38_chrom_info$chrom)
stopifnot(identical(unname(seqlengths(genome)), hg38_chrom_info$size[m]))

Bioconductor/GenomeInfoDb documentation built on Dec. 2, 2024, 1:41 a.m.