read.sparse: Read Sparse Matrix from Text File

Description Usage Arguments Details Value References See Also Examples

View source: R/read_sparse.R

Description

Read a labelled sparse CSR matrix in text format as used by libraries such as SVMLight, LibSVM, ThunderSVM, LibFM, xLearn, XGBoost, LightGBM, and more.

The format is as follows:

<label(s)> <column>:<value> <column>:<value> ...

with one line per observation/row.

Example line (row):

1 1:1.234 3:20

This line denotes a row with label (target variable) equal to 1, a value for the first column of 1.234, a value of zero for the second column (which is missing), and a value of 20 for the third column.

The labels might be decimal (for regression), and each row might contain more than one label (must be integers in this case), separated by commas without spaces inbetween - e.g.:

1,5,10 1:1.234 3:20

This line indicates a row with labels 1, 5, and 10 (for multi-class classification). If the line has no labels, it should still include a space before the features.

The rows might additionally contain a 'qid' parameter as used in ranking algorithms, which should always lay inbetween the labels and the features and must be an integer - e.g.:

1 qid:2 1:1.234 3:20

The file might optionally contain a header as the first line with metadata (number of rows, number of columns, number of classes). Presence of a header will be automatically detected, and is recommended to include it for speed purposes. Datasets from the extreme classification repository (see references) usually include such a header.

Lines might include comments, which start after a '#' character. Lines consisting of only a '#' will be ignored. When reading from a file, such file might have a BOM (information about encoding uses in Windows sytems), which will be automatically skipped.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
read.sparse(
  file,
  multilabel = FALSE,
  has_qid = FALSE,
  integer_labels = FALSE,
  index1 = TRUE,
  sort_indices = TRUE,
  ignore_zeros = TRUE,
  min_cols = 0L,
  min_classes = 0L,
  limit_nrows = 0L,
  no_trailing_ws = FALSE,
  from_string = FALSE
)

Arguments

file

Either a file path from which the data will be read, or a string ('character' variable) containing the text from which the data will be read. In the latter case, must pass 'from_string=TRUE'.

multilabel

Whether the input file can have multiple labels per observation. If passing 'multilabel=FALSE' and it turns out to have multiple labels, will only take the first one for each row. If the labels are non-integers or have decimal point, the results will be invalid.

has_qid

Whether the input file has 'qid' field (used for ranking). If passing 'FALSE' and the file does turns out to have 'qid', the features will not be read for any observations.

integer_labels

Whether to output the observation labels as integers.

index1

Whether the input file uses numeration starting at 1 for the column numbers (and for the label numbers when passing 'multilabel=TRUE'). This is usually the case for files downloaded from the repositories in the references. The function will check for whether any of the column indices is zero, and will ignore this option if so (i.e. will assume it is 'FALSE').

sort_indices

Whether to sort the indices of the columns after reading the data. These should already be sorted in the files from the repositories in the references.

ignore_zeros

Whether to avoid adding features which have a value of zero. If the zeros are caused due to numerical rounding in the software that wrote the input file, they can be post-processed by passing 'ignore_zeros=FALSE' and then something like 'X@x[X@x == 0] = 1e-8'.

min_cols

Minimum number of columns that the output 'X' object should have, in case some columns are all missing in the input data.

min_classes

Minimum number of columns that the output 'y' object should have, in case some columns are all missing in the input data. Only used when passing 'multilabel=TRUE'.

limit_nrows

Maximum number of rows to read from the data. If there are more than this number of rows, it will only read the first 'limit_nrows' rows. If passing zero (the default), there will be no row limit.

no_trailing_ws

Whether to assume that lines in the file will never have extra whitespaces

from_string

Whether to read the data from a string variable instead of a file. If passing 'from_string=TRUE', then 'file' is assumed to be a variable with the data contents on it.

Details

Note that this function:

Be aware that the data is represented as a CSR matrix with index pointer of class C 'int', thus the number of rows/columns/non-zero-elements cannot exceed '.Machine$integer.max'.

On Windows, if the package is installed from CRAN and compiled using the GCC compiler version 4 or earlier (the default in older versions of RTools, such as Rtools35), it will not be able to read from or write to file names with non-ASCII characters, which can be solved by installing it directly from the GitHub repository ('remotes::install_github("david-cortes/readsparse")'). Whether support for non-ASCII file names is available or not can be checked through readsparse_nonascii_support.

On 64-bit Windows systems, if compiling the library with a compiler other than MinGW or MSVC, it will not be able to read files larger than 2GB. This should not be a concern if installing it from CRAN or from R itself, as the Windows version at the time of writing can only be compiled with MinGW.

If the file contains a header, and this header denotes a larger number of columns or of labels than the largest index in the data, the resulting object will have this dimension set according to the header. The third entry in the header (number of classes/labels) will be ignored when passing 'multilabel=FALSE'.

The function uses different code paths when reading from a file or from a string, and there might be slight differences between the obtained results from them. For example, reading from a file might produce the desired output if the file uses tabs as separators instead of spaces (not supported by most other software and not standard), whereas reading from a string will not. If any such difference is encountered, please submit a bug report in the package's GitHub page.

Value

A list with the following entries:

These can be easily transformed to other sparse matrix types through e.g. 'X <- as(X, "CsparseMatrix")'.

References

Datasets in this format can be found here:

The format is also described at the SVMLight webpage: http://svmlight.joachims.org.

See Also

write.sparse

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
library(Matrix)
library(readsparse)

### Example input file
"1 2:1.21 5:2.05
-1 1:0.45 3:0.001 4:-10" -> coded.matrix

r <- read.sparse(coded.matrix, from_string=TRUE)
print(r)

### Convert it back to text
recoded.matrix <- write.sparse(file=NULL, X=r$X, y=r$y, to_string=TRUE)
cat(recoded.matrix)

### Example with real file I/O
## generate a random sparse matrix and labels
set.seed(1)
X <- rsparsematrix(nrow=5, ncol=10, nnz=8)
y <- rnorm(5)

## save into a text file
temp_file <- file.path(tempdir(), "matrix.txt")
write.sparse(temp_file, X, y, integer_labels=FALSE)

## inspect the text file
cat(paste(readLines(temp_file), collapse="\n"))

## read it back
r <- read.sparse(temp_file)
print(r)

### (Note that columns with all-zeros are discarded,
###  this behavior can be avoided with 'add_header=TRUE')

readsparse documentation built on Oct. 14, 2021, 9:10 a.m.