Introduction

csvsplittr provides a simple function, split_csv, to split a large CSV (or other tab-delimited) file into a specified number of files.

This is because ther software using CSVs, such as the Linguistic Inquiry and Word Count (LIWC) program for computational text analysis, or even Google Sheets and Microsoft Excel, perform poorly or not at all with very large CSV files.

This package provides a function to read a CSV file, split it into a specified number of files, and write those smaller files to the working directory.

There is one function, split_csv() which takes as inputs a file name for a CSV file and a specified number of files to split, and an optional argument specifying the delimiter if not a comma (i.e., a tab if the file is tab-delimited).

It splits the data frame output from reading the CSV file, rounding the number of rows in the data frame divided by the number of files to the ceiling, with the final split containing the rows remaining from the second to last split to the final row of the data frame output from reading the CSV File.

The split files are saved to the working directory with a number appended to the original file name to indicate their position when split.

Example

Here is a short example (not run) using a CSV file with data from Donors Choose. Let's say it is about 1 GB and we wanted to split it into four files, each about 250 MB:

library(csvsplittr)
split_csv("really_large_file.csv", 4)

Without loading the library, you can also use the function as following:

csvsplittr::split_csv("really_large_file.csv", 4)

If we check what is in the working directory, we will see the original file and the split files, i.e. donors_choose_data-1.csv, donors_choose_data-2.csv, donors_choose_data-3.csv, and donors_choose_data-4.csv, with the same headers as the original file.



jrosen48/csvsplittr documentation built on May 20, 2019, 2:06 a.m.