DSD_ReadStream: Read a Data Stream from a File or a Connection
In stream: Infrastructure for Data Stream Mining

DSD_ReadStream

R Documentation

Read a Data Stream from a File or a Connection

Description

A DSD class that reads a data stream (text format) from a file or any R connection.

Usage

DSD_ReadStream(
  file,
  k = NA,
  take = NULL,
  sep = ",",
  header = FALSE,
  skip = 0,
  col.names = NULL,
  colClasses = NA,
  outofpoints = c("warn", "ignore", "stop"),
  ...
)

DSD_ReadCSV(
  file,
  k = NA,
  take = NULL,
  sep = ",",
  header = FALSE,
  skip = 0,
  col.names = NULL,
  colClasses = NA,
  outofpoints = c("warn", "ignore", "stop"),
  ...
)

## S3 method for class 'DSD_ReadStream'
close_stream(dsd, ...)

## S3 method for class 'DSD_ReadCSV'
close_stream(dsd, ...)

Arguments

`file`	A file/URL or an open connection.
`k`	Number of true clusters, if known.
`take`	indices of columns to extract from the file.
`sep`	The character string that separates dimensions in data points in the stream.
`header`	Does the first line contain variable names?
`skip`	the number of lines of the data file to skip before beginning to read data.
`col.names`	A vector of optional names for the variables. The default is to use `"V"` followed by the column number. Additional information (e.g., class labels) need to have names starting with `.`.
`colClasses`	A vector of classes to be assumed for the columns passed on to `read.table()`.
`outofpoints`	Action taken if less than `n` data points are available. The default is to return the available data points with a warning. Other supported actions are: `warn`: return the available points (maybe an empty data.frame) with a warning. `ignore`: silently return the available points. `stop`: stop with an error.
`...`	Further arguments are passed on to `read.table()`. This can for example be used for encoding, quotes, etc.
`dsd`	A object of class `DSD_ReadCSV`.

Details

DSD_ReadStream uses readLines() and read.table() to read data from an R connection line-by-line and convert it into a data.frame. The connection is responsible for maintaining where the stream is currently being read from. In general, the connections will consist of files stored on disk but have many other possibilities (see connection).

The implementation tries to gracefully deal with slightly corrupted data by dropping points with inconsistent reading and producing a warning. However, this might not always be possible resulting in an error instead.

Column names

If the file has column headers in the first line, then they can be used by setting header = TRUE. Alternatively, column names can be set using col.names or a named vector for take. If no column names are specified then default names will be created.

Columns with names that start with . are considered information columns and are ignored by DSTs. See get_points() for details.

Other information columns are are used by various functions.

Reading the whole stream By using n = -1 in get_points(), the whole stream is returned.

Resetting and closing a stream

The position in the file can be reset to the beginning or another position using reset_stream(). This fails of the underlying connection is not seekable (see connection).

DSD_ReadStream maintains an open connection to the stream and needs to be closed using close_stream().

DSD_ReadCSV reads a stream from a comma-separated values file.

Value

An object of class DSD_ReadCSV (subclass of DSD_R, DSD).

Author(s)

Michael Hahsler

Examples

# Example 1: creating data and writing it to disk
stream <- DSD_Gaussians(k = 3, d = 2)
write_stream(stream, "data.txt", n = 100, info = TRUE, header = TRUE)
readLines("data.txt", n = 5)

# reading the same data back
stream2 <- DSD_ReadStream("data.txt", header = TRUE)
stream2

# get points
get_points(stream2, n = 5)
plot(stream2, n = 20)

# clean up
close_stream(stream2)
file.remove("data.txt")

# Example 2:  Read part of the kddcup1999 data (take only cont. variables)
# col 42 is the class variable
file <- system.file("examples", "kddcup10000.data.gz", package = "stream")
stream <- DSD_ReadCSV(gzfile(file),
        take = c(1, 5, 6, 8:11, 13:20, 23:41, .class = 42), k = 7)
stream

get_points(stream, 5)

# plot 100 points (projected on the first two principal components)
plot(stream, n = 100, method = "pca")

close_stream(stream)

stream documentation built on April 4, 2025, 1:02 a.m.