View source: R/getChromInfoFromUCSC.R
getChromInfoFromUCSC | R Documentation |
getChromInfoFromUCSC
returns chromosome information like
sequence names, lengths and circularity flags for a given UCSC genome
e.g. for hg19, panTro6, sacCer3, etc...
Note that getChromInfoFromUCSC
behaves slightly differently
depending on whether a genome is registered in the
GenomeInfoDb package or not. See below for the details.
Use registered_UCSC_genomes
to list all the UCSC genomes
currently registered in the GenomeInfoDb package.
getChromInfoFromUCSC(genome,
assembled.molecules.only=FALSE,
map.NCBI=FALSE,
add.ensembl.col=FALSE,
goldenPath.url=getOption("UCSC.goldenPath.url"),
recache=FALSE,
as.Seqinfo=FALSE)
registered_UCSC_genomes(organism=NA)
genome |
A single string specifying the name of a UCSC genome
e.g. |
assembled.molecules.only |
If If Note that |
map.NCBI |
If Note that not all rows in the "UCSC chrom info" data frame are
necessarily mapped to a row in the "NCBI chrom info" data frame.
For example chrM in hg19 has no corresponding sequence in the GRCh37
assembly (the mitochondrial chromosome was omitted from GRCh37).
For the unmapped rows the NCBI columns in the final data frame
are filled with NAs ( The primary use case for using |
add.ensembl.col |
|
goldenPath.url |
A single string specifying the URL to the UCSC goldenPath location where the chromosome sizes are expected to be found. |
recache |
|
as.Seqinfo |
|
organism |
When |
*** Registered vs unregistered UCSC genomes ***
For registered genomes, the returned data frame contains
information about which sequences are assembled molecules and which
are not, and the assembled.molecules.only
argument is
supported. For unregistered genomes, this information is
missing, and the assembled.molecules.only
argument is ignored
with a warning.
For registered genomes, the returned circularity flags are guaranteed to be accurate. For unregistered genomes, a heuristic is used to determine the circular sequences.
For registered genomes, special care is taken to make sure that the sequences are returned in a sensible order. For unregistered genomes, a heuristic is used to return the sequences in a sensible order.
Please contact the maintainer of the GenomeInfoDb package to request registration of additional genomes.
*** Offline mode ***
getChromInfoFromUCSC()
supports an "offline mode" when called
with assembled.molecules.only=TRUE
, but only for a selection of
registered genomes. The "offline mode" works thanks to a collection
of tab-delimited files stored in the package, that contain the "assembled
molecules info" for the supported genomes. This makes calls like:
getChromInfoFromUCSC("hg38", assembled.molecules.only=TRUE)
fast and reliable i.e. the call will always work, even when offline!
See README.TXT in GenomeInfoDb/inst/extdata/assembled_molecules_db/UCSC/ for more information.
Note that calling getChromInfoFromUCSC()
with
assembled.molecules.only=FALSE
(the default), or with
recache=TRUE
, will trigger retrieval of the chromosome
info from UCSC, and will issue a warning if this info no longer
matches the "assembled molecules info" stored in the package.
Please contact the maintainer of the GenomeInfoDb package to request genome additions to the "offline mode".
For getChromInfoFromUCSC
: By default, a 4-column data frame
with columns:
chrom
: character.
size
: integer.
assembled
: logical.
circular
: logical.
If map.NCBI
is TRUE
, then 7 "NCBI columns" are added
to the result:
NCBI.SequenceName
: character.
NCBI.SequenceRole
: factor.
NCBI.AssignedMolecule
: factor.
NCBI.GenBankAccn
: character.
NCBI.Relationship
: factor.
NCBI.RefSeqAccn
: character.
NCBI.AssemblyUnit
: factor.
Note that the names of the "NCBI columns" are those returned
by getChromInfoFromNCBI
but with the NCBI.
prefix added to them.
If add.ensembl.col
is TRUE
, the column ensembl
is added to the result.
For registered_UCSC_genomes
: A data frame summarizing all the UCSC
genomes currently registered in the GenomeInfoDb package.
H. Pagès
getChromInfoFromNCBI
for getting chromosome
information for an NCBI assembly.
getChromInfoFromEnsembl
for getting chromosome
information for an Ensembl species.
Seqinfo objects.
The getBSgenome
convenience utility in
the BSgenome package for getting a BSgenome
object from an installed BSgenome data package.
## ---------------------------------------------------------------------
## A. BASIC EXAMPLES
## ---------------------------------------------------------------------
## --- Internet access required! ---
getChromInfoFromUCSC("hg19")
getChromInfoFromUCSC("hg19", as.Seqinfo=TRUE)
## Map the hg38 sequences to their corresponding sequences in
## the GRCh38.p13 assembly:
getChromInfoFromUCSC("hg38", map.NCBI=TRUE)[c(1, 5)]
## Note that some NCBI-based UCSC genomes contain sequences that
## are not mapped. For example this is the case for chrM in hg19:
hg19 <- getChromInfoFromUCSC("hg19", map.NCBI=TRUE)
hg19[is.na(hg19$NCBI.SequenceName), ]
## Map the hg19 sequences to the Ensembl sequence names:
getChromInfoFromUCSC("hg19", add.ensembl.col=TRUE)
## --- No internet access required! (offline mode) ---
getChromInfoFromUCSC("hg19", assembled.molecules.only=TRUE)
getChromInfoFromUCSC("panTro6", assembled.molecules.only=TRUE)
getChromInfoFromUCSC("bosTau9", assembled.molecules.only=TRUE)
## --- List of UCSC genomes currently registered in the package ---
registered_UCSC_genomes()
## All registered UCSC genomes for Felis catus (domestic cat):
registered_UCSC_genomes(organism = "Felis catus")
## All registered UCSC genomes for Homo sapiens:
registered_UCSC_genomes("homo")
## ---------------------------------------------------------------------
## B. USING getChromInfoFromUCSC() TO SET UCSC SEQUENCE NAMES ON THE
## GRCh38 GENOME
## ---------------------------------------------------------------------
## Load the BSgenome.Hsapiens.NCBI.GRCh38 package:
library(BSgenome)
genome <- getBSgenome("GRCh38") # this loads the
# BSgenome.Hsapiens.NCBI.GRCh38 package
genome
## Get the chromosome info for the hg38 genome:
hg38_chrom_info <- getChromInfoFromUCSC("hg38", map.NCBI=TRUE)
ncbi2ucsc <- setNames(hg38_chrom_info$chrom,
hg38_chrom_info$NCBI.SequenceName)
## Set the UCSC sequence names on 'genome':
seqlevels(genome) <- ncbi2ucsc[seqlevels(genome)]
genome
## Sanity check: check that the sequence lengths in 'genome' are the same
## as in 'hg38_chrom_info':
m <- match(seqlevels(genome), hg38_chrom_info$chrom)
stopifnot(identical(unname(seqlengths(genome)), hg38_chrom_info$size[m]))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.