duplicates_stream — duplicates_stream • tessistream

Create a dataset of duplicates derived from matching names, email addresses, postal addresses, and phone numbers. The match probability for each probable duplicate is determined using the fastLink package.

Usage

duplicates_stream(...)

duplicates_append_data(
  data,
  features = rlang::exprs(`In household` = in_household, `Current membership` =
    !is.na(memb_level), `Last login date` = last_login_dt, `Last activity date` =
    last_activity_dt, `Last update date` = last_update_dt)
)

duplicates_suppress_related(data)

duplicates_exact_match(data, match_cols)

Arguments

...

Arguments passed on to tessilake::read_tessi

freshness: the returned data will be at least this fresh
incremental: whether or not to load data incrementally, default is TRUE

data

data.table to deduplicate

features

sorted, named list of expressions to use for selecting which customer to keep. Note: this is not ready to be exposed; at the moment the appended data is very limited.

match_cols

columns to match on

Value

data.table of duplicates

data.table

data.table of matching customers with columns customer_no and i.customer_no

data.table of data for deduplication

Details

Deduplication using fastLink

This system is based on the fastLink package, a "Fast Probabilistic Record Linkage" library developed by Ted Enamorado, Ben Fifield, and Kosuke Imai.

fastLink is designed to find matches between two datasets across multiple variables. It was designed for linking together political/social science datasets. For deduplication, duplicates_stream compares the data against itself. Links will either be trivial \((A=A)\) or they will be duplicates.

fastLink identifies matches in a way that is sensitive to the structure of the data it is given: it assigns to each match pattern a probability that it is a match. A match pattern is a unique way that the variables match or don’t match: for example, two records that match on first name, partially match on last name, one is missing data for email, etc. The algorithm tabulates counts of every occurring combination of match / partial-match / non-match / missing data across every variable for every combination of two records. It then uses a Bayesian expectation-maximization algorithm to determine the probability that each combination of matches / non-matches / etc. identifies a proper match, using the overall expected match rate as a prior. Then, given a probability threshold, it outputs the identified matches.

duplicates_stream is using this system as-is with some updates for Tessi-specific deduplication:

it pulls multiple rows of data per customer from Tessitura in order to find hidden matches
it cleans the data, removing missing/incomplete info and also suppresses certain data that produces many false positives (school addresses and @bam.org emails)
it is trained based on successfully-merged records in the database
it uses blocking on first name, and a second pass of simple email address matching in order to reduce the overall computation time and catch more duplicates
it caches datasets to disk to reduce memory load
it provides reporting that includes the match pattern and posterior probability for each pair

Functions

duplicates_append_data(): Chooses which of each duplicate pair to keep or delete based on appended data and return the reason for the choice.
duplicates_suppress_related(): Suppress duplicates that are related by household or that have a relationships (in Tessi-speak: an association) with each other.
duplicates_exact_match(): Returns a minimal (in the sense of each duplicate pairing is limited to the two closest customer numbers in a duplicate cluster, and each pair is listed only once) set of customer pairs matching exactly on match_cols.
duplicates_data(): Load data for duplicates_stream

Note

depends on address_stream