Skip to contents

mlr3 model for predicting a customer's first contribution

Usage

contributions_model

contributions_dataset(
  since = Sys.Date() - 365 * 5,
  until = Sys.Date(),
  rebuild_dataset = NULL,
  chunk_size = 1e+07,
  ...
)

# S3 method for class 'contributions_model'
read(
  model,
  since = Sys.Date() - 365 * 5,
  until = Sys.Date(),
  predict_since = Sys.Date() - 30,
  rebuild_dataset = NULL,
  downsample_read = 1,
  predict = NULL,
  ...
)

# S3 method for class 'contributions_model'
train(model, num_trees = 512, downsample_train = 0.1, ...)

# S3 method for class 'contributions_model'
predict(model, ...)

# S3 method for class 'contributions_model'
output(
  model,
  downsample_output = 1,
  features = NULL,
  n_repetitions = 5,
  n_top = 500,
  n_features = 25,
  ...
)

Format

An object of class contributions_model (inherits from mlr_report, report, list) of length 0.

Arguments

since

Date/POSIXct data on or after this date will be loaded and possibly used for training

until

Date/POSIXct data after this date will not be used for training or predictions, defaults to the beginning of today

rebuild_dataset

boolean rebuild the dataset by calling contributions_dataset(since=since,until=until) (TRUE), just read the existing one (FALSE), or append new rows by calling contributions_dataset(since=max_existing_date,until=until) (NULL, default)

chunk_size

integer(1) number of rows per partition

...

not used

model

contributions_model object

predict_since

Date/POSIXct data on/after this date will be used to make predictions and not for training

downsample_read

numeric(1) the amount to downsample the dataset on read

predict

Not used, just here to prevent partial argument matching

num_trees

integer(1) maximum number of trees to use for ranger model

downsample_train

double(1) fraction of observations to use for training, defaults to .1

downsample_output

numeric(1) the proportion of the test set to use for feature importance and Shapley explanations

features

character The names of the features for which to compute the feature effects/importance.

n_repetitions

numeric(1) How many shufflings of the features should be done? See iml::FeatureImp for more info.

n_top

integer(1) the number of rows, ranked by probability to analyze and explain as 'top picks'.

n_features

integer(1) the number of features, ranked by importance to analyze.

Methods (by generic)

  • read(contributions_model): Read in contribution data and prepare a mlr3 training task and a prediction/validation task

  • train(contributions_model): Tune and train a stacked log-reg/ranger model on the data

  • predict(contributions_model): Predict using the trained model

  • output(contributions_model): create IML reports for contributions_model

Functions

  • contributions_dataset(): Build the contributions dataset from the overall stream.

    • events are the first contribution per household > $50

    • data after an event are censored

    • contribution indicators are rolled back and timestamps are normalized to the start of customer activity

    • only data since since are loaded Data is written to the primary cache partitioned by year and then synced across storages

Note

Data will be loaded in-memory, because [inaudible] mlr3 doesn't work well with factors encoded as dictionaries in arrow tables.

Preprocessing:

  • ignore 1-day and email "send" features because they leak data

  • remove constant features

  • balance classes to a ratio of 1:10 T:F

  • Yeo-Johnson with tuned boundaries

  • impute missing values out-of-range and add missing data indicator features

  • feature importance filter (for log-reg submodel only)

Model:

  • stacked log-reg + ranger > log-reg model

  • tuned using a hyperband method on the AUC (sensitivity/specificity)