elec2 | R Documentation |
Electricity Pricing Data Set Exhibiting Concept Drift
data(elec2)
A data frame with 27552 observations on the following 5 variables.
x1
a numeric vector
x2
a numeric vector
x3
a numeric vector
x4
a numeric vector
y
class label
This data has become a benchmark of sorts in streaming classification. It was first described by Harries (1999) and used thereafter for several performance comparisons [e.g., Baena-Garcia et al. (2006); Kuncheva and Plumpton, (2008)]. It holds information for the Australian New South Wales (NSW) Electricity Market, containing 27552 records dated from May 1996 to December 1998, each referring to a period of 30 minutes subsampled as the completely observed portion of 45312 total records with missing values. These records have seven fields: a binary class label, two time stamp indicators (day of week, time), and four covariates capturing aspects of electricity demand and supply.
An appealing property of this dataset is that it is expected to contain drifting data distributions since, during the recording period, the electricity market was expanded to include adjacent areas. This allowed for the production surplus of one region to be sold in the adjacent region, which in turn dampened price levels.
M. Harries. “Splice-2 Comparative Evaluation: Electricity Pricing”. University of New South Wales, School of Computer Science and Engineering technical report (1999)
Anagnostopoulos, C., Gramacy, R.B. (2013) “Information-Theoretic Data Discarding for Dynamic Trees on Data Streams.” Entropy, 15(12), 5510-5535; arXiv:1201.5568
M. Baena-Garcia, J. del Campo-Avila, R., Fidalgo, A. Bifet, R. Gavalda and R. Morales-Bueno. “Early drift detection method”. ECML PKDD 2006 Workshop on Knowledge Discovery from Data Streams, pp. 77-86 (2006)
L.I. Kuncheva C.O. and Plumpton. “Adaptive Learning Rate for Online Linear Discriminant Classifiers”. SSPR and SPR 2008, Lecture Notes in Computer Science (LNCS), 5342, pp. 510-519 (2008)
## this is a snipet from the "elec2" demo; see that demo
## for a full comparison to dynaTree models which can
## cope with drifting concepts
## set up data
data(elec2)
X <- elec2[,1:4]
y <- drop(elec2[,5])
## predictive likelihood for repated trials
T <- 200 ## use nrow(X) for a longer version,
## short T is for faster CRAN checks
hits <- rep(NA, T)
## fit the initial model
n <- 25; N <- 1000
fit <- dynaTree(X[1:n,], y[1:n], N=N, model="class")
w <- 1
for(t in (n+1):T) {
## predict the next data point
## full model
fit <- predict(fit, XX=X[t,], yy=y[t])
hits[t] <- which.max(fit$p) == y[t]
## sanity check retiring index
if(any(fit$X[w,] != X[t-n,])) stop("bad retiring")
## retire
fit <- retire(fit, w)
## update retiring index
w <- w + 1; if(w >= n) w <- 1
## update with new point
fit <- update(fit, X[t,], y[t], verb=100)
}
## free C-side memory
deleteclouds()
## plotting a moving window of hit rates over time
rhits <- rep(0, length(hits))
for(i in (n+1):length(hits)) {
rhits[i] <- 0.05*as.numeric(hits[i]) + 0.95*rhits[i-1]
}
## plot moving window of hit rates
plot(rhits, type="l", main="moving window of hit rates",
ylab="hit rates", xlab="t")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.