mlr3 model for predicting a customer's first contribution
Usage
contributions_model
contributions_dataset(
since = Sys.Date() - 365 * 5,
until = Sys.Date(),
rebuild_dataset = NULL,
chunk_size = 1e+07,
...
)
# S3 method for class 'contributions_model'
read(
model,
since = Sys.Date() - 365 * 5,
until = Sys.Date(),
predict_since = Sys.Date() - 30,
rebuild_dataset = NULL,
downsample_read = 1,
predict = NULL,
...
)
# S3 method for class 'contributions_model'
train(model, num_trees = 512, downsample_train = 0.1, ...)
# S3 method for class 'contributions_model'
predict(model, ...)
# S3 method for class 'contributions_model'
output(
model,
downsample_output = 1,
features = NULL,
n_repetitions = 5,
n_top = 500,
n_features = 25,
...
)
Arguments
- since
Date/POSIXct data on or after this date will be loaded and possibly used for training
- until
Date/POSIXct data after this date will not be used for training or predictions, defaults to the beginning of today
- rebuild_dataset
boolean rebuild the dataset by calling
contributions_dataset(since=since,until=until)
(TRUE), just read the existing one (FALSE), or append new rows by callingcontributions_dataset(since=max_existing_date,until=until)
(NULL, default)- chunk_size
integer(1) number of rows per partition
- ...
not used
- model
contributions_model
object- predict_since
Date/POSIXct data on/after this date will be used to make predictions and not for training
- downsample_read
numeric(1)
the amount to downsample the dataset on read- predict
Not used, just here to prevent partial argument matching
- num_trees
integer(1)
maximum number of trees to use for ranger model- downsample_train
double(1)
fraction of observations to use for training, defaults to .1- downsample_output
numeric(1)
the proportion of the test set to use for feature importance and Shapley explanations- features
character
The names of the features for which to compute the feature effects/importance.- n_repetitions
numeric(1)
How many shufflings of the features should be done? See iml::FeatureImp for more info.- n_top
integer(1)
the number of rows, ranked by probability to analyze and explain as 'top picks'.- n_features
integer(1)
the number of features, ranked by importance to analyze.
Methods (by generic)
read(contributions_model)
: Read in contribution data and prepare a mlr3 training task and a prediction/validation tasktrain(contributions_model)
: Tune and train a stacked log-reg/ranger model on the datapredict(contributions_model)
: Predict using the trained modeloutput(contributions_model)
: create IML reports for contributions_model
Functions
contributions_dataset()
: Build the contributions dataset from the overall stream.events are the first contribution per household > $50
data after an event are censored
contribution indicators are rolled back and timestamps are normalized to the start of customer activity
only data since
since
are loaded Data is written to the primary cache partitioned by year and then synced across storages
Note
Data will be loaded in-memory, because [inaudible] mlr3 doesn't work well with factors encoded as dictionaries in arrow tables.
Preprocessing:
ignore 1-day and email "send" features because they leak data
remove constant features
balance classes to a ratio of 1:10 T:F
Yeo-Johnson with tuned boundaries
impute missing values out-of-range and add missing data indicator features
feature importance filter (for log-reg submodel only)