Predict user behavior with Propensity Scores in RudderStack Profiles.
14 minute read
Using Profile’s Propensity Scores Data App, you can predict the likelihood of user actions using machine learning (ML) algorithms. These predictive capabilities enable data-driven decision-making by calculating scores that represent the probability of a user performing a specific action within a predefined timeframe, for example:
Is a customer likely to churn in the next 30 days?
Will a user make a purchase in the next 14 days?
Is a lead likely to convert in the next 7 days?
Use cases
Reduced churn: Identify users at risk of churning and implement targeted interventions to retain them.
Increased conversions: Prioritize leads with a higher propensity to convert, boosting your marketing campaign effectiveness.
Improved resource allocation: Focus resources on high-value user segments for maximized impact.
Prerequisites
An active RudderStack Profiles project (v0.18.0 or above) using a Snowflake, BigQuery, or Redshift warehouse.
(Optional) If you are using Snowflake, you might need to create a Snowpark-optimized warehouse if your dataset is significantly large.
Install the profiles-mlcorelib library in your Python environment (version 3.9.0 to v3.11.10) using pip install profiles-mlcorelib. Note that it should be the same Python environment as the profiles-rudderstack library.
Update your pb_project.yaml file to include the library:
python_requirements:- profiles_mlcorelib>=0.6.0
Project setup
The following steps guide you on how to set up a propensity model to generate propensity scores within an existing Profiles project:
Step 1: Define the label (prediction target)
Identify and define the action you want to predict (also known as label), for example, inactivity churn, subscription churn, a lead conversion, payer conversion, reactivation, etc.
Let’s assume you want to predict whether a user will pay for your product. You can create a label named is_payer which takes the value as true for users who have paid and false for those who haven’t.
A sample entity_var definition of is_payer label will be:
entity_var:name:is_payerselect:case when user.revenue > 0 then 1 else 0 end
Label data types
For propensity modeling, the data type must be Boolean or Binary, that is, the label must have only two distinct values like 0/1, true/false, yes/no, etc.
Although not the primary focus on the Propensity model, it can also be used to predict numeric values. In that case the label is a continuous numeric value.
This can be used, for example, for predictions like predicted LTV or predicted purchase total in the next 30 days.
Step 2: Define the relevant entity_vars
You need to specify the user set for model training. For example, you can define following entity_vars that may predict whether a user is likely to spend anything in future:
Some entity_vars may not directly depend on the label and you may not want to train the model on all the users. Define the entity_vars for an eligible user set. For example, to predict a user’s likelihood of paying for a product, define the following features:
New users who created their account within the past 30 days.
Ensure that the entity_vars for features and label originate from input tables with a defined occured_at_col value.
This enables Profiles to correctly materialize past data while considering specific timeframes. Without this, you might get overly optimistic models with misleading metrics.
Step 3: Set the prediction window
Define the time frame within which you want to predict the user behavior. The predict_window_days setting determines this period, for example, you might want to predict churn within the next 30 days.
The value for predict_window_days is use case-dependent. For instance, a daily-use game might benefit from a 7-day churn prediction window, while a monthly subscription service might require a 3-month window.
A sample profiles.yaml with propensity model type and predict_window_days setting:
models:- name:payer_propensity_modelmodel_type:propensitymodel_spec:inputs:- entity/user/days_since_account_creation- entity/user/days_since_last_seen- entity/user/revenue- entity/user/is_payer- entity/user/country- entity/user/n_sessionstraining:predict_var:entity/user/is_payer label_value:1predict_window_days:30eligible_users:days_since_account_creation <= 30 and country = 'US' and revenue = 0
Note that:
All the features and label used by the model (defined as entity_vars), are provided as a list in inputs.
RudderStack appends the value of eligible_users key to an SQL query, forming a query like select * from user_var_table where <eligible_users>. Hence, you must define all the columns used in eligible_users key as entity_vars.
Step 4: Name the predictive features
Specify the features you want to predict within the prediction block. Here’s an example for predicting the payer propensity:
prediction:output_columns:percentile:name:payer_propensity_percentiledescription:Percentile score of a user's likelihood to pay in the next 30 daysscore:name:payer_propensity_probabilitydescription:Probability score of a user's likelihood to pay in the next 30 days
Sample YAML for a predictive feature
After completing the above steps, a complete sample profiles.yaml file for a predictive feature will look as follows:
models:- name:payer_propensity_modelmodel_type:propensitymodel_spec:entity_key:usertraining:predict_var:entity/user/is_payerlabel_value:1predict_window_days:30validity:month type:classification eligible_users:days_since_account_creation <= 30 and country = 'US' and revenue = 0max_row_count:50000recall_to_precision_importance:1.0new_materialisations_config:strategy:autofeature_data_min_date_diff:14max_no_of_dates:3dates:- '2024-01-01,2024-01-08'# (feature_date, label_date)- '2024-02-01,2024-02-08'- '2024-03-01,2024-03-08'ignore_features:- country prediction:output_columns:percentile:name:payer_propensity_percentiledescription:Percentile score of a user's likelihood to pay in the next 30 daysscore:name:payer_propensity_probabilitydescription:Probability score of a user's likelihood to pay in the next 30 daysis_feature:Falseeligible_users:'*'# (Optional) Defaults to what's in training.inputs:- entity/user/days_since_account_creation- entity/user/days_since_last_seen- entity/user/revenue- entity/user/is_payer- entity/user/country- entity/user/n_sessions
You can setup multiple predictive features within the same Profiles project by repeating the whole block for each predictive feature.
The following table explains the fields used in the above file:
Field
Description
name Required
Name of the model.
model_type Required
Model type. Set this to propensity.
model_spec Required
Detailed configuration specification for the model.
entity_key Required
Entity to be used.
training Required
Configuration used for training.
predict_var Required
entity_var for which you want make predictions.
label_value Required
Value of label column for which prediction needs to be generated.
predict_window_days Required
Time period within which you want to predict the user behavior.
validity Required
Time period to re-train the model. Allowed values are: day, week, month.
Re-training helps keep the model up-to-date with changing user behavior. But it comes with the cost of increased compute time and resource usage. So its preferable to keep the validity longer (for example, month) if you expect the user behavior doesn’t change too frequently.
type
Tells the model whether you are trying to predict a boolean feature (for example, yes/no, churned/not_churned, converted/not_converted, etc.), or a numeric feature (for example, amount_spent). If it is a boolean feature, the type should be classification. If it’s numeric, type should be regression. It defaults to classification if not specified. So, you can skip this parameter if you are predicting churn, subscription, etc.
eligible_users
User set for model training. RudderStack appends the value of eligible_users key to an SQL query, forming a query like select * from user_var_table where <eligible_users>. Hence, you must define all the columns used in eligible_users key as entity_vars.
If not provided, it defaults to users who have predict_var != label_value during feature generation period, that is, it generates propensity scores for all the users who have not yet converted in the feature generation period.
max_row_count
Maximum samples to be used for ML training. However, the actual number of used samples depend on the availability. Defaults to 500,000.
recall_to_precision_importance
An advanced feature that adjusts the balance between false positives and false negatives in the model’s output. If reducing false positives is more important, set the value closer to 0 (for example, 0.3).
If reducing false negatives is the priority, use a value greater than 1 (for example, 3.0). False positives are costly when it leads to unnecessary actions, such as sending a promo code to loyal users incorrectly identified as at risk of churning. On the other hand, false negatives are more problematic when missing out on targeting someone could lead to a permanent loss, such as a subscription customer who churns and never returns.
If you’re unsure about which to prioritize, you can leave the value at the default of 1.0 (or simply omit the parameter, and the model will automatically set it to 1.0).
new_materialisations_config
This block lets you configure the training data for model in case there is not enough data. See FAQ for more information.
ignore_features
List of entity_var names that are a part of the inputs but should be ignored by the model. This usually happens when your input is a feature_table model, or an SQL model where there are too many defined features for purposes unrelated to the predictive features.
prediction Required
Configuration used for prediction. It mainly lists the names of output columns and eligible users for whom predictions are to be generated.
output_columns Required
Configuration for output columns to be generated.
percentile Required
Configuration of the column in output table having percentile score.
name Required
Name of the column in output table having percentile score.
description Required
Custom description for the percentile column.
score Required
Configuration of the column in output table having probabilistic score.
name Required
Name of the column in output table having probabilistic score.
description Required
Custom description for the score column.
is_feature Required
If set to False, this feature won’t be available in final C360 table. Defaults to True.
inputs Required
List of input models and entity_vars used by the model.
You must include the label column and vars used in eligible_users definition here along with the list of features that the model uses to train.
Step 5: Run your project
After configuring your project, you can run it using one of the following methods:
Using Profile CLI
If you have created your Profiles project locally, run it using the pb run CLI command to generate the output tables.
Once your project run is complete, you can view the following outputs:
Training output
The model generates several outputs based on its training parameters, including charts and a JSON file with metrics. RudderStack stores these artifacts in the outputs folder outputs/seq_no/<seq_no>/run/Material_<model_name><hash>_<seq_no>_reports, for example:
Feature Importance Chart: Highlights the key features influencing predictions.
Cumulative Gain Chart: Provides a visual comparison between the model’s performance and random selection, helping to quantify the model’s ability to identify positive instances efficiently. This chart is particularly useful for optimizing resource allocation in scenarios like marketing campaigns or customer targeting.
Precision-Recall Curve: Assesses the trade-off between precision and recall at various probability thresholds. It is particularly useful for imbalanced datasets, highlighting the model’s ability to identify positive instances while minimizing false positives.
ROC Curve: Assesses the trade-off between true positives and false positives. It helps highlight the model’s ability to separate positive and negative labels.
Additionally, RudderStack writes the metrics from the training_summary.json file to a new row in the TRAINING_METRICS_v4 table in your warehouse.
Prediction output
Based on the prediction parameters, Profiles creates a new table in your warehouse. The table’s name is the same as the model along with a hash and seqence number (for example, Material_payer_propensity_30_days_4dd846le_21).
The table contains the following columns, along with the user_main_id:
Probability score: A value between 0 and 1 indicating the likelihood of a user action, like making a purchase.
Percentile score: A scaled version of the probability score useful for targeting specific user segments, for example, the top 10% for a campaign.
Boolean flag: A true/false indicator of whether a user is likely to perform the action, based on the model’s confidence.
The above columns are essentially the same, only represented in different forms. They represent the same prediction in different formats, catering to various use-cases/levels of granularity and interpretation preferences.
View output
You can either:
View the output materials in your warehouse, OR
If your Profiles project is in the RudderStack dashboard:
Check the predicted value for any given user in the Profile Lookup section.
Explore predictive features in the Entities tab of your Profiles project.
Click Predictive features to see the following view:
Use existing feature setup
You can even use an existing feature table set up outside of Profiles (say, any dbt or SQL project) to generate propensity scores. To do so, you need to provide the existing features as inputs via SQL models. The SQL model serves as a wrapper for your feature table.
Note the following specifications for using SQL models:
Define the feature table where original features are defined as an input model. Ensure that the is_event_stream: true, and occurred_at_col is specified. This enables creating feature snapshots at different points in the past. To support this, the feature table should retain the output from old runs as well. As your external pipeline creates new feature table snapshots daily, they should not replace the old data. Ideally, a fresh run should append new data to the same table. But if each job creates a new table, you can union all such tables to create a view. Make sure you have a timestamp column that denotes when the job was run.
The output of the SQL model should have a single row per user.
Ensure that column names in the SQL model are unique even if the SQL models are combined with other input types, such as entity_var.
The SQL model table should already include the entity’s main id column, for example, user_main_id.
Here’s an example to configure SQL models:
# Define the dbt generated feature table in inputs.yamlinputs:- name:rsFeatureTablecontract:is_optional:falseis_event_stream:truewith_entity_ids:- userwith_columns:- name:timestamp- name:user_idapp_defaults:table:user_featuresoccurred_at_col:timestampids:- select:"user_id"type:user_identity:user# Define an id stitcher over this, so the table gets a user_main_id column attachedmodels:- name:rs_id_stitchermodel_type:id_stitchermodel_spec:entity_key:useredge_sources:- from:inputs/rsFeatureTable# Define an SQL model on top of thismodels:- name:rsFeatureTableSnapshotmodel_type:sql_templatemodel_spec:single_sql:| {% with feature_table = this.DeRef("inputs/rsFeatureTable/var_table") entity_id = this.DeRef("inputs/rsTracks/rsFeatureTable/user_main_id") %}
select a.{{entity_id}}, f1, f2, f3 from
(
select * from {{feature_table}} where
timestamp <= '{{end_time.Format(\"2006-01-02 15:04:05\")}}'
) a inner join
(select {{entity_id}}, max(timestamp) as max_timestamp from {{feature_table}} where
timestamp <= '{{end_time.Format(\"2006-01-02 15:04:05\")}}'
group by {{entity_id}}) b on
a.{{entity_id}} = b.{{entity_id}} and a.timestamp = b.max_timestamp
{% endwith %}ids:- select:"user_id"type:user_identity:userto_default_stitcher:truecontract:is_optional:falseis_event_stream:falsewith_entity_ids:- userwith_columns:- name:user_id
FAQ
How does the model select data for training?
For training the ML models, RudderStack needs a minimum of 5000 samples. If the model finds that there fewer than 5000 samples, RudderStack materializes more training data by running the Profiles project at different timestamps in the past. There are two materialization strategies: auto or manual.
auto strategy
The strategy parameter in the new_materialisations_config block specifies the strategy for generating new materials. It is set to auto by default, where the model creates upto 3 pairs of materials (total 6) where each pair is 2 weeks (14 days) apart by default. RudderStack starts with the most recent data, and keeps going back 14 days, till there is enough training data (5000 samples).
Additionally, you can use the below parameters to adjust the default behavior:
max_no_of_dates: Specifies the number of materials to be generated. The default value is 3 pairs.
feature_data_min_date_diff: Specifies the minimum number of days between newly generated materials and existing materials. The default value is 2 weeks (14 days).
manual strategy
In this case, you can mention the dates from where the training pairs should be generated using the following parameter:
dates: Uses exact dates to generate the feature and label pairs.
Which features should I use as inputs to the model?
RudderStack recommends using features that help you build an effective model and solve your business problem. The commonly used features can be the ones that reflect user behavior and characteristics that influence the outcome you’re predicting. For instance:
For subscription churn, relevant features may include how long a user has been subscribed and how often they interact with key product features.
For lead scoring, factors like UTM source, days since sign-up, and engagement with pricing pages may be more relevant.
How can I know if the trained model is good?
Although there is no universal good score for model evaluation metrics, RudderStack uses a few guidelines:
Feature Importance: The top features in the feature importance chart should align with your expectations. Unexpected features or missing important ones might indicate an issue.
Cumulative Gain Chart: If the model curve is close to the baseline, the model is likely underperforming.
ROC-AUC: A score close to 0.5 indicates random performance, while scores closer to 1 indicate better model performance. However, highly imbalanced datasets may yield misleadingly high ROC-AUC scores, so precision should also be considered.
Train-Val-Test Metrics: Training metrics are usually higher than validation/test metrics. A large gap suggests overfitting. Validation and test metrics should be close to each other.
This site uses cookies to improve your experience while you navigate through the website. Out of
these
cookies, the cookies that are categorized as necessary are stored on your browser as they are as
essential
for the working of basic functionalities of the website. We also use third-party cookies that
help
us
analyze and understand how you use this website. These cookies will be stored in your browser
only
with
your
consent. You also have the option to opt-out of these cookies. But opting out of some of these
cookies
may
have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This
category only includes cookies that ensures basic functionalities and security
features of the website. These cookies do not store any personal information.
This site uses cookies to improve your experience. If you want to
learn more about cookies and why we use them, visit our cookie
policy. We'll assume you're ok with this, but you can opt-out if you wish Cookie Settings.