MotifDb: MotifDb: An Annotated Collection of DNA-binding sequence...

Description Details References See Also Examples

Description

Approximately 2000 position frequency matrices collected from public sources, with ample accompanying metadata, and search and export capabilities provided.

Details

MotifDb is an R object of class MotifList, whose entries are numeric matrices, accompanied by a 'parallel' metadata structure, a DataFrame, in which each row provides information about the corresponding matrix. This object is automatically created and fully populated by data from five public sources (see below) when the package is loaded into your R environment via the library call. The matrices are obtained from six public sources:

FlyFactorSurvey: 614
hPDI: 437
JASPAR_CORE: 459
jolma2013: 843
ScerTF: 196
stamlab: 683
UniPROBE: 380
cisbp 1.02 874

Representing primarily five organsisms (and 49 total):

Hsapiens: 2328
Dmelanogaster: 1008
Scerevisiae: 701
Mmusculus: 660
Athaliana: 160
Celegans: 44
other: 177

All the matrices are stored as position frequency matrices, in which each columm (each position) sums to 1.0. When the number of sequences which contributed to the motif are known, that number will be found in the matrix's metadata. With this information, one can transform the matrices into either PCM (position count matrices), or PWM (position weight matrices), also known as PSSM (position-specific-scoring matrices). The latter transformation requires that a model of the background distribution be known, or assumed.

The names of the matrices are the same as rownames of the metadata DataFrame, and have been chosen to balance the needs of concision and full description, including the organism in which the motif was discovered, the data source, and the name of the motif in the data source from which it was obtained. For example: "Hsapiens-JASPAR_CORE-SP1-MA0079.2" and "Scerevisiae-ScerTF-GSM1-badis".

Subsets of the Matrices may be obtainted in several ways:

The matrices are stored in a SimpleList which has semantics very similar to the familiar list of R base. To examine a matrix, however, you must sidestep the MotifDb show method. These three commands display quite different results:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
> MotifDb [1]
MotifDb object of length 1
| Created from downloaded public sources: 2012-Jul6
| 1 position frequency matrices from 1 source:
|    FlyFactorSurvey:    1
| 1 organism/s
|      Dmelanogaster:    1
Dmelanogaster-FlyFactorSurvey-ab_SANGER_10_FBgn0259750 

> MotifDb [[1]]
    1    2    3    4 5 6 7 8 9   10   11   12   13   14   15   16   17   18   19   20   21
A 0.0 0.50 0.20 0.35 0 0 1 0 0 0.55 0.35 0.05 0.20 0.45 0.20 0.10 0.40 0.40 0.25 0.50 0.30
C 0.3 0.15 0.25 0.00 1 1 0 0 0 0.10 0.65 0.70 0.45 0.25 0.10 0.25 0.25 0.10 0.10 0.25 0.25
G 0.4 0.05 0.50 0.65 0 0 0 1 1 0.00 0.00 0.05 0.05 0.15 0.05 0.20 0.05 0.15 0.55 0.15 0.45
T 0.3 0.30 0.05 0.00 0 0 0 0 0 0.35 0.00 0.20 0.30 0.15 0.65 0.45 0.30 0.35 0.10 0.10 0.00

> as.list (MotifDb [1])
$`Dmelanogaster-FlyFactorSurvey-ab_SANGER_10_FBgn0259750`
    1    2    3    4 5 6 7 8 9   10   11   12   13   14   15   16   17   18   19   20   21
A 0.0 0.50 0.20 0.35 0 0 1 0 0 0.55 0.35 0.05 0.20 0.45 0.20 0.10 0.40 0.40 0.25 0.50 0.30
C 0.3 0.15 0.25 0.00 1 1 0 0 0 0.10 0.65 0.70 0.45 0.25 0.10 0.25 0.25 0.10 0.10 0.25 0.25
G 0.4 0.05 0.50 0.65 0 0 0 1 1 0.00 0.00 0.05 0.05 0.15 0.05 0.20 0.05 0.15 0.55 0.15 0.45
T 0.3 0.30 0.05 0.00 0 0 0 0 0 0.35 0.00 0.20 0.30 0.15 0.65 0.45 0.30 0.35 0.10 0.10 0.00

There are fifteen kinds of metadata – though not all matrices have a full complement: not all of the public sources are complete in this regard. The information falls into these categories, using the Dmelanogaster-FlyFactorSurvey-ab_SANGER_10_FBgn0259750 entry as an example (see below for the associated position frequency matrix):

  1. providerName: "ab_SANGER_10_FBgn0259750"

  2. providerId: "FBgn0259750"

  3. dataSource: "FlyFactorSurvey"

  4. geneSymbol: "Ab"

  5. geneId: "FBgn0259750"

  6. geneIdType: "FLYBASE"

  7. proteinId: "E1JHF4"

  8. proteinIdType: "UNIPROT"

  9. organism: "Dmelanogaster"

  10. sequenceCount: 20

  11. bindingSequence: NA

  12. bindingDomain: NA

  13. tfFamily: NA

  14. experimentType: "bacterial 1-hybrid, SANGER sequencing"

  15. pubmedID: NA

References

See Also

query, subset, export, flyFactorSurvey, hPDI, jaspar, ScerTF, uniprobe

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
     # are there any matrices for Sox4?  we find two
   mdb.sox4 <- MotifDb [grep ('sox4', values (MotifDb)$geneSymbol, ignore.case=TRUE)]
     # the same two matrices can be obtained this way also
   if (interactive ()) 
     mdb.sox4 <- subset (MotifDb, tolower(geneSymbol)=='sox4')
     # and like this
   mdb.sox4 <- query (MotifDb, 'sox4')  # matches against all fields in the metadata
     # implicitly invoke the 'show' method
   mdb.sox4
     # get their full names
   names (mdb.sox4)
     # examine their metadata
   values (mdb.sox4)
     # examine the matrices with names include
   as.list (mdb.sox4)
     # export the matrices in meme format 
   destination.file = tempfile ()
   export (mdb.sox4, destination.file, 'meme')

MotifDb documentation built on Nov. 8, 2020, 6:28 p.m.