dataset_chunk_write — dataset_chunk

Write out a chunk of a larger dataset, using Hadoop partition=partition nomenclature, saving it in the cache dir dataset_name. The chunk is identified by the rowid column in rows, which is attached to the columns of the dataset identified by cols. Features matching the regular expression in rollback are rolled back one row, and all timestamps are normalized by dataset_normalize_timestamps.

Usage

dataset_chunk_write(
  dataset,
  partition,
  dataset_name,
  rows = data.table(rowid = seq_len(nrow(dataset))),
  cols = colnames(dataset),
  ...
)

Arguments

dataset

data.frameish dataset to load from, must contain an index column and cannot be an arrow_dplyr_query

partition

character|integer identifying the partition the chunk will be saved in

dataset_name

character cache directory where the partition will be saved in

rows

data.table identifying rows of the dataset to load; will be appended to dataset

cols

character columns of the dataset to add to partition

...

Arguments passed on to dataset_rollback_event, dataset_normalize_timestamps

rollback_cols: character vector of columns to roll back
event: character column name containing a logical feature that indicates events to rollback
by: character column name to group the table by
timestamp_cols: character vector of columns to normalize; defaults to all columns with a name containing the word timestamp