Description Usage Arguments Details Value Author(s) Examples
View source: R/motif_analysis.R
Load the SNP data.
1 2 3 4 |
filename |
A table containing the SNP information. Must contain at least five columns with exactly the following names:
If this file exists already, it is used to extract the SNP information. Otherwise, SNP information extracted using argument 'snpids' is outputted to this file. | |||||||||||
genome.lib |
A string of the library name for the genome version. Default: "BSgenome.Hsapiens.UCSC.hg38". | |||||||||||
snp.lib |
A string of the library name to obtain the SNP information based on rs ids. Default: "SNPlocs.Hsapiens.dbSNP144.GRCh38". | |||||||||||
snpids |
A vector of rs ids for the SNPs. This argument is overidden
if the file with name | |||||||||||
half.window.size |
An integer for the half window size around the SNP within which the motifs are matched. Default: 30. | |||||||||||
default.par |
A boolean for whether using the default Markov parameters. Default: FALSE. | |||||||||||
mutation |
A boolean for whether this is mutation data. See details for more information. Default: FALSE. | |||||||||||
... |
Other parameters passed to |
This function extracts the nucleotide sequence within a window
around each SNP and code them using 1-A, 2-C, 3-G, 4-T.
There are two ways of obtaining the nucleotide sequences. If filename
is not NULL and the file exists, it should contain the positions and alleles
for each SNP. Based on such information, the sequences around SNP positions
are extracted using the Bioconductor annotation package specified by
genome.lib
. Users should make sure that this annotation package
corresponds to the correct species and genome version of the actual data.
Alternatively, users can also provide a vector of rs ids via the argument
snpids
. The SNP locations and allele information is then obtained via
the Bioconductor annotation package specified by snp.lib
, and passed
on to the package specified by genome.lib
to further obtain the
nucleotide sequences.
If mutation=FALSE
(default), this function assumes that the data is
for SNP analysis, and the reference genome should be consistent with either
the a1 or a2 nucleotide. When extracting the genome sequence around each SNP
position, this function compares the nucleotide at the SNP location on the
reference genome with both a1 and a2 to distinguish between the reference
allele and the SNP allele. If the nucleotide extracted from the reference
genome does not match either a1 or a2, the SNP is discarded. The discarded
SNPs are in the 'rsid.rm' field in the output.
Alternatively, if mutation=TRUE
, this function assumes that the data
is for general single nucleotide mutation analysis. After extracting the
genome sequence around each SNP position, it replaces the nucleotide at the
SNP location by the a1 nucleotide as the 'reference' allele sequence, and by
the a2 nucleotide as the 'snp' allele sequence. It does NOT discard the
sequence even if neither a1 or a2 matches the reference genome. When this
data set is used in other functions, such as ComputeMotifScore
,
ComputePValues
, all the results (i.e. affinity scores and
their p-values) for the reference allele are indeed for the a1 allele, and
results for the SNP allele are indeed for the a2 allele.
If the input is a list of rsid's, the SNP information extracted from
snp.lib
may contain more than two alleles for a single location. For
such cases, LoadSNPData
first extracts all pairs of alleles
associated with those locations. If 'mutation=TRUE', all those pairs are
considered as pairs of reference and SNP alleles, and their information is
contained in 'sequence_matrix', 'a1', 'a2' and 'snpid'. If 'mutation=FALSE',
LoadSNPData
further filters these pairs based on whether one
allele matches to the reference genome nucleotide extracted from
genome.lib
. Only those pairs with one allele matching the reference
genome nucleotide is considered as pairs of reference and SNP alleles, with
their information contained in 'sequence_matrix', 'a1', 'a2' and 'snpid'.
A list object containing the following components:
sequence_matrix | A list of integer vectors representing the deroxyribose sequence around each SNP. |
a1 | An integer vector for the deroxyribose at the SNP location on the reference genome. |
a2 | An integer vector for the deroxyribose at the SNP location on the SNP genome. |
snpid | A string vector for the SNP rsids. |
rsid.missing | If the data source is a list of rsids, this field records rsids for SNPs that are discarded because they are not in the SNPlocs package. |
rsid.duplicate | If the data source is a list of rsids, this field records rsids for SNPs that based on the SNPlocs package, this locus has more than 2 alleles. |
rsid.na | This field records rsids for SNPs that are discarded because the nucleotide sequences contain none ACGT characters. |
rsid.rm | If the data source is a table and mutation=FALSE , this
field records rsids for SNPs that are discarded because the nucleotide on the
reference genome matches neither 'a1' or 'a2' in the data source. |
The results are coded as: "A"-1, "C"-2, "G"-3, "T"-4.
Chandler Zuo chandler.c.zuo@gmail.com
1 2 3 4 5 |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.