getChromInfoFromNCBI: Get chromosome information for an NCBI assembly

View source: R/getChromInfoFromNCBI.R

getChromInfoFromNCBIR Documentation

Get chromosome information for an NCBI assembly

Description

getChromInfoFromNCBI returns chromosome information like sequence names, lengths and circularity flags for a given NCBI assembly e.g. for GRCh38, ARS-UCD1.2, R64, etc...

Note that getChromInfoFromNCBI behaves slightly differently depending on whether the assembly is registered in the GenomeInfoDb package or not. See below for the details.

Use registered_NCBI_assemblies to list all the NCBI assemblies currently registered in the GenomeInfoDb package.

Usage

getChromInfoFromNCBI(assembly,
                     assembled.molecules.only=FALSE,
                     assembly.units=NULL,
                     recache=FALSE,
                     as.Seqinfo=FALSE)

registered_NCBI_assemblies(organism=NA)

Arguments

assembly

A single string specifying the name of an NCBI assembly (e.g. "GRCh38"). Alternatively, an assembly accession (GenBank or RefSeq) can be supplied (e.g. "GCF_000001405.12").

assembled.molecules.only

If FALSE (the default) then chromosome information is returned for all the sequences in the assembly (unless assembly.units is specified, see below), that is, for all the chromosomes, plasmids, and scaffolds.

If TRUE then chromosome information is returned only for the assembled molecules. These are the chromosomes (including the mitochondrial chromosome) and plasmids only. No scaffolds.

assembly.units

If NULL (the default) then chromosome information is returned for all the sequences in the assembly (unless assembled.molecules.only is set to TRUE, see above), that is, for all the chromosomes, plasmids, and scaffolds.

assembly.units can be set to a character vector containing the names of Assembly Units (e.g. "non-nuclear") in which case chromosome information is returned only for the sequences that belong to these Assembly Units.

recache

getChromInfoFromNCBI uses a cache mechanism so the chromosome information of a given assembly only gets downloaded once during the current R session (note that the caching is done in memory so cached information does NOT persist across sessions). Setting recache to TRUE forces a new download (and recaching) of the chromosome information for the specified assembly.

as.Seqinfo

TRUE or FALSE (the default). If TRUE then a Seqinfo object is returned instead of a data frame. Note that only the SequenceName, SequenceLength, and circular columns of the data frame are used to make the Seqinfo object. All the other columns are ignored (and lost).

organism

When organism is specified, registered_NCBI_assemblies() will only return the subset of assemblies that are registered for that organism. organism must be specified as a single string and will be used to perform a search (with grep()) on the "organism" column of the data frame returned by registered_NCBI_assemblies(). The search is case-insensitive.

Details

registered vs unregistered NCBI assemblies:

  • All NCBI assemblies can be looked up by assembly accession (GenBank or RefSeq) but only registered assemblies can also be looked up by assembly name.

  • For registered assemblies, the returned circularity flags are guaranteed to be accurate. For unregistered assemblies, a heuristic is used to determine the circular sequences.

Please contact the maintainer of the GenomeInfoDb package to request registration of additional assemblies.

Value

For getChromInfoFromNCBI: By default, a 10-column data frame with columns:

  1. SequenceName: character.

  2. SequenceRole: factor.

  3. AssignedMolecule: factor.

  4. GenBankAccn: character.

  5. Relationship: factor.

  6. RefSeqAccn: character.

  7. AssemblyUnit: factor.

  8. SequenceLength: integer. Note that this column **can** contain NAs! For example this is the case in assembly Amel_HAv3.1 where the length of sequence MT is missing or in assembly Release 5 where the length of sequence Un is missing.

  9. UCSCStyleName: character.

  10. circular: logical.

For registered_NCBI_assemblies: A data frame summarizing all the NCBI assemblies currently registered in the GenomeInfoDb package.

Author(s)

H. Pagès

See Also

  • getChromInfoFromUCSC for getting chromosome information for a UCSC genome.

  • getChromInfoFromEnsembl for getting chromosome information for an Ensembl species.

  • Seqinfo objects.

Examples

## All registered NCBI assemblies for Triticum aestivum (bread wheat):
registered_NCBI_assemblies("tri")[1:4]

## All registered NCBI assemblies for Homo sapiens:
registered_NCBI_assemblies("homo")[1:4]

## Internet access required!
getChromInfoFromNCBI("GRCh37")
getChromInfoFromNCBI("GRCh37", as.Seqinfo=TRUE)
getChromInfoFromNCBI("GRCh37", assembled.molecules.only=TRUE)

## The GRCh38.p14 assembly only adds "patch sequences" to the GRCh38
## assembly:
GRCh38 <- getChromInfoFromNCBI("GRCh38")
table(GRCh38$SequenceRole)
GRCh38.p14 <- getChromInfoFromNCBI("GRCh38.p14")
table(GRCh38.p14$SequenceRole)  # 254 patch sequences (164 fix + 90 novel)

## All registered NCBI assemblies for Arabidopsis thaliana:
registered_NCBI_assemblies("arabi")[1:4]
getChromInfoFromNCBI("TAIR10.1")
getChromInfoFromNCBI("TAIR10.1", assembly.units="non-nuclear")

## Sanity checks:
idx <- match(GRCh38$SequenceName, GRCh38.p14$SequenceName)
stopifnot(!anyNA(idx))
tmp1 <- GRCh38.p14[idx, ]
rownames(tmp1) <- NULL
tmp2 <- GRCh38.p14[-idx, ]
stopifnot(
  identical(tmp1[ , -(5:7)], GRCh38[ , -(5:7)]),
  identical(tmp2, GRCh38.p14[GRCh38.p14$AssemblyUnit == "PATCHES", ])
)

Bioconductor/GenomeInfoDb documentation built on Dec. 2, 2024, 1:41 a.m.