contributions_model — contributions

mlr3 model for predicting a customer's first contribution

Usage

contributions_model

contributions_dataset(
  since = Sys.Date() - 365 * 5,
  until = Sys.Date(),
  rebuild_dataset = NULL,
  chunk_size = 1e+07,
  ...
)

# S3 method for class 'contributions_model'
read(
  model,
  since = Sys.Date() - 365 * 5,
  until = Sys.Date(),
  predict_since = Sys.Date() - 30,
  rebuild_dataset = NULL,
  downsample_read = 1,
  predict = NULL,
  ...
)

# S3 method for class 'contributions_model'
train(model, num_trees = 512, downsample_train = 0.1, ...)

# S3 method for class 'contributions_model'
predict(model, ...)

# S3 method for class 'contributions_model'
output(
  model,
  downsample_output = 1,
  features = NULL,
  n_repetitions = 5,
  n_top = 500,
  n_features = 25,
  ...
)

Format

An object of class contributions_model (inherits from mlr_report, report, list) of length 0.

Arguments

since: Date/POSIXct data on or after this date will be loaded and possibly used for training
until: Date/POSIXct data after this date will not be used for training or predictions, defaults to the beginning of today
rebuild_dataset: boolean rebuild the dataset by calling contributions_dataset(since=since,until=until) (TRUE), just read the existing one (FALSE), or append new rows by calling contributions_dataset(since=max_existing_date,until=until) (NULL, default)
chunk_size: integer(1) number of rows per partition
...: not used
model: contributions_model object
predict_since: Date/POSIXct data on/after this date will be used to make predictions and not for training
downsample_read: numeric(1) the amount to downsample the dataset on read
predict: Not used, just here to prevent partial argument matching
num_trees: integer(1) maximum number of trees to use for ranger model
downsample_train: double(1) fraction of observations to use for training, defaults to .1
downsample_output: numeric(1) the proportion of the test set to use for feature importance and Shapley explanations
features: character The names of the features for which to compute the feature effects/importance.
n_repetitions: numeric(1) How many shufflings of the features should be done? See iml::FeatureImp for more info.
n_top: integer(1) the number of rows, ranked by probability to analyze and explain as 'top picks'.
n_features: integer(1) the number of features, ranked by importance to analyze.

Methods (by generic)

read(contributions_model): Read in contribution data and prepare a mlr3 training task and a prediction/validation task
train(contributions_model): Tune and train a stacked log-reg/ranger model on the data
predict(contributions_model): Predict using the trained model
output(contributions_model): create IML reports for contributions_model

Functions

contributions_dataset(): Build the contributions dataset from the overall stream.
- events are the first contribution per household > $50
- data after an event are censored
- contribution indicators are rolled back and timestamps are normalized to the start of customer activity
- only data since since are loaded Data is written to the primary cache partitioned by year and then synced across storages

Note

Data will be loaded in-memory, because [inaudible] mlr3 doesn't work well with factors encoded as dictionaries in arrow tables.

Preprocessing:

ignore 1-day and email "send" features because they leak data
remove constant features
balance classes to a ratio of 1:10 T:F
Yeo-Johnson with tuned boundaries
impute missing values out-of-range and add missing data indicator features
feature importance filter (for log-reg submodel only)

Model:

stacked log-reg + ranger > log-reg model
tuned using a hyperband method on the AUC (sensitivity/specificity)