Description Usage Format Details Source References
The coraAI data consists of a response, journal indication
matrix, and co-citation network. This data is a subset of the Cora
text mining project (refer to reference).
The observations are text documents that consist of 879 published
papers about either Artificial Intelligence (AI) or Machine
Learning (ML). The journal name for each document is available
(8 journals and an other category). The observed co-citation graph
is also available, where each vertex is a document (observation), and
the edge is the count of citations in common between each document and
all other documents.
The goal is to incorporate both the text information and co-citation
information for the prediction of paper subject AI/ML.
Another, interesting problem might be to predict the journal of the
paper given the text information and the categorization.
1 |
The coraAI data consists of three objects each discussed next.
class: categorization of the document(observation) as either
AI or ML. Typically the response.
journals: indication of the document as published in a specific
journal, (other, artificial-intelligence, machine-learning,
nueral-computing, ieee-trans-Nnet, ieee-tpami,
j-artificial-intelligence-research, ai-magazine, JASA)
cite: the adjacency matrix of the co-citation network for these
879 documents.
The spa is particularly appealing for this data since it fits a function directly to the graph and coeficient vector to the journals. Other approaches require convergence of the journal information into a graph for processing, which is unclear when the data is a binary design matrix.
The data was generated using AWK scripting from the cora raw sweet (first reference). The journal names were fixed to obtain a useable representation (e.g. tpami, ieee tpami, pami are all ieee-tpami).
A. McCallum, K. Nigam, J. Rennie, and K. Seymore (2000). Automating the construction of internet portals with machine learning. Information Retrieval Journal, 3.
M. Culp (2011). spa: A Semi-Supervised R Package for Semi-Parametric Graph-Based Estimation. Journal of Statistical Software, 40(10), 1-29. URL http://www.jstatsoft.org/v40/i10/.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.