Additional concepts related to Profiles like packages, best practices, partial feature tables, etc.
19 minute read
This guide explains some of the advanced concepts related to Profiles.
Packages
Profiles gives you the flexibility to utilize models from existing library projects while defining your own models and inputs within the PB project. This approach allows for a seamless integration of library of pre-existing features, which are readily available and can be applied directly to data streamed into your warehouse.
In the absence of any explicitly defined models, the PB project is capable of compiling and running models from the library package given that inputs are present in the warehouse as assumed in the lib package.
Packages currently work only on Snowflake.
The following list of packages are currently available in Profiles. You can contact the RudderStack team to access these:
Generally, there will be some deviations in terms of the database name and schema name of input models - however, you can easily handle this by remapping inputs.
A sample pb_project.yaml file may look as follows:
In this case, the PB project imports a single package. It does not require a separate models folder or entities as the input and output models will be sourced from the imported packages.
Note that:
If non-mandatory inputs required by the model are not present in the warehouse, you can still run the model.
If there is any deviation in the table/view name for input models, that is, if the inputs assumed in library package are present under some other name, make sure to do the remapping.
If some of the assumed inputs are not present at all, they should be remapped to nil. This way you can create and run imported packages with minimal set of inputs present.
For example, to import a library package with the name of shopify_features:
Note that the name of the table/view is changed to the appropriate name in your warehouse. If tables are present with the same name (including database name and schema name) then no remapping is required.
Modify ID types
Extend existing package
You can add custom ID types to the default list or modify an existing one by extending the package to include your specifications.
For the corresponding id_type, add the key extends: followed by name of the same/different id_type that you wish to extend and the filters with include/exclude values.
Enlists the type of identifiers to be used for creating ID stitcher/entity_var/input_var. For example, you can define anonymous IDs that do not include the value undefined or email addresses in proper format.
Field
Data type
Description
name
String
Name of the ID type like email, user ID, etc.
extends
List
(Optional) Name of the ID type you wish to extend.
filters
List
Filter(s) the ID types to include/exclude specific values. The filters are processed in the defined order.
filters
Field
Data type
Description
type
String
Type of filter. Allowed values are include or exclude.
value
String
Value to match, for example, you can reject certain invalid ID values like NaN, unknown, test@domain.com, etc.
regex
String
Regular expression with which to match the values.
sql
List
SQL statement with select and from keys.
Custom list of ID types
To have custom list of ID types other than the provisions in the default package, you can remove and add your list as follows:
Make sure that the ID types are also defined in the entity definition.
Model Contracts
You can use the contract field to specify the constraints your model should adhere to while using the warehouse data.
Suppose a model (M1) is dependent on model (M2). Now, M1 can specify a contract defining the columns and entities that it needs from M2 to be executed successfully. Also, it becomes mandatory for M2 to provide the required columns and entities for contract validation.
Example 1
The following inputs.yaml file defines a contract:
The input table (rsIdentifies) must exist in the warehouse.
The model is an event stream model where every row in the table must be an event.
There must be a column in the inputs table (rsIdentifies) which represents the user identifier for user entity.
The input table (rsIdentifies) must have the timestamp, user_id, and anonymous_id columns.
Example 2
Let’s consider a SQL model, rsSessionTable which takes shopify_session_features as an input:
models:- name:rsSessionTablemodel_type:sql_templatemodel_spec:...# model specificationssingle_sql:| {% set contract = BuildContract('{"with_columns":[{"name":"user_id"}, {"name":"anonymous_id"}]}') %}
{% with SessionFeature = this.DeRef("models/shopify_session_features",contract)%}
select user_id as id1, anonymous_id as id2 from {{SessionFeature}}contract:with_entity_ids:- userwith_columns:- name:user_idtype:string- name:anonymous_idtype:string
There are two contracts defined in the above example:
Contract for shopify_session_features model which dictates that the user_id, and anonymous_id columns must be present.
Contract for rsSessionTable model which dictates that it must have a column representing the user identifier for user entity. Also, the user_id, and anonymous_id columns must be present.
This helps in improving the data quality, error handling, and enables static and dynamic validation of the project.
Partial feature tables
Partial feature tables are created when only a few input sources are available.
For example, lets say that you import a library package and some of the input models assumed in the package are not present in your warehouse.
When you remap some of these input models to nil, those inputs and the features directly or indirectly dependent upon those inputs are disabled. In such cases, a partial feature table is created from the rest of the available inputs. Similarly, ID stitcher also runs even if few of the edge sources are not present in the warehouse or remapped to nil.
Pre and post hooks
A pre hook enables you to execute an SQL before running a model, for example, if you want to change DB access, create a DB object, etc. Likewise, a post hook enables you to execute an SQL after running a model. The SQL can also be templatized. Here’s an example code snippet:
models:- name:test_id_stitchermodel_type:id_stitcherhooks:pre_run:"CREATE OR REPLACE VIEW {{warehouse.ObjRef('V1')}} AS (SELECT * from {{warehouse.ObjRef('Temp_tbl_a')}});"post_run:'CREATE OR REPLACE VIEW {{warehouse.ObjRef("V2")}} AS (SELECT * from {{warehouse.ObjRef("Temp_tbl_a")}});'model_spec:- # rest of model specs go here
Use Amazon S3 bucket as input
This is an experimental feature.
If you store data in your Amazon S3 bucket in a CSV file format, you can use it as an input for the Profiles models. The S3 URI path must be specified in the app_defaults.s3:
To escape comma (,) from any cell of the CSV file, enclose that cell with double quotes " " .
Double quotes (" ") enclosing a cell are ignored.
Follow the below steps to grant PB the required permissions to access the file in S3 Bucket:
Private S3 bucket
Add region, access key id, secret access key, and session token in your siteconfig file so that PB can access the private bucket. By default, the region is set to us-east-1 unless specified otherwise.
In Redshift, you additionally need to set an IAM role as default for your cluster, unless access keys are provided.
It is necessary because more than one IAM role can be associated with the cluster, and Redshift needs explicit
permission granted through an IAM role to access the S3 bucket (Public or Private).
An input file (models/inputs.yaml) contains details of input sources such as tables, views, or CSV files along with column name and SQL expression for retrieving values.
You can read data from a CSV file by using csv: <path_to_filename> under app_defaults in the input specs. CSV data is loaded internally as a single SQL select query, making it useful for seeding tests.
A sample code is as shown:
app_defaults:csv:"../common.xtra/Temp_tbl_a.csv"# remaining syntax is same for all input sources
RudderStack does not support CSV files with more than a few hundred rows.
Filter data
You can filter out any data by using the filters field in your projects file:
For example, if you want to exclude all the blacklisted email addresses, you can create an input model (for example, csv_email_blacklist) with CSV file as a source, that contains all such email addresses:
Another example, if you want to exclude all the user_ids, you can create an SQL model (for example, sql_exclusion_model) that contains a specific logic to enlist all such IDs:
For private repos, RudderStack only supports SSH Git URLs. You need to add credentials to the siteconfig.yaml and public ssh key manually to the platforms. See Use private Git repos via CLI.
The URL scheme doesn’t depend on individual Git provider host. You can use the below-mentioned Git URLs:
RudderStack supports any subfolder in git project without .git extension.
Best schema version (tags)
It is recommended to use git-tags instead of the latest commit on main branch of your library projects. Also, you can use a specific tag, for example: https://github.com/org-name/lib-name/tag/schema_<n>.
If you want Profile Builder to figure out the best schema version for every run, you can use the placeholder {{best_schema_version}}, for example, https://github.com/org-name/lib-name/tag/schema_{{best_schema_version}}. The selection of compatible git tags is done by PB, that is, it will figure out the best compatible version for the lib package.
Using this will make Profiles use the best compatible version of the library project in case of any schema updates.
You don’t have to replace the placeholder {{best_schema_version}}. For instance, if https://github.com/org-name/lib-names/tags/ has a tag for schema_44, then https://github.com/org-name/lib-names/tag/schema_44 will be automatically used. In any case, if you replace the placeholder with actual tag name, the project will work without any issues.
View model dependencies
You can create a DAG to see all the model dependencies, that is, how a model is dependent on other models by using any one of the following commands:
pb show dataflow OR pb show dependencies
Further, you can use the pb show models command to view information about the models in your project. See show command for more information.
Multi-version support
This feature is supported for the Profiles versions 0.10.8, 0.11.5, and 0.12.0 onwards.
You can constraint your Profiles project to run only on specific version(s) by specifying it in the pb_project.yaml file, under python_requirements key. For example, use the below snippet to run your project on v0.10.8:
Use the below snippet to stay on any minor version between 0.12.0 and 0.13.0. If a new minor version is released, your project will be auto-migrated to that version:
If you do not specify any version in pb_project.yaml, the latest Profiles version is used by default. The version constraints follow the same syntax as those of Python dependency specifiers.
Make sure that the version of Profiles project is the same in your environment and the pb_project.yaml file. Otherwise, RudderStack will throw an error.
Reuse models output
This is an experimental feature.
You can define the time_grain parameter for a model to ensure the model runs only once in that time period. It lets you reuse the output material of that model within the time period, preventing unnecessary recalculations.
For example, if you set the time_grain value for an ID stitcher model to a day and run the feature table model (based on the ID stitcher) multiple times during a day, the feature table model will reuse the ID stitcher’s output to compute the features. This will save a large amount of time and computations whenever you run the feature table model within that day.
Similarly, you can choose to run a feature such as weekly_spends only once a week, or last_active_day on a daily basis.
Setting time_grain parameter does not mean that the model will run automatically at the specified time period. It just ensures that the model runs only once during that time period and its outputs are reused within that duration. To schedule your project run, you must use the RudderStack dashboard.
For example, if time_grain for a model is set to a week, its outputs will be reused throughout the week and running it twice within the week won’t change its results.
You can define the time_grain parameter in the profiles.yaml file of your project:
In the above example, suppose you schedule your Profiles project to run every hour. The output for user_id_graph model will be computed every hour, the output for user_daily_vars will be computed once a day, and the output for user_weekly_vars will be computed once a week.
For a default ID stitcher model, you can define the time_grain value in the entities section as shown below:
You can set the following values for the time_grain field:
tick: Considers all the data up to the current moment (default value).
10minutes: Considers data up to the last 10-minute interval.
hour: Considers data up to the end of the previous hour.
day: Considers data up to the end of the previous day.
week: Considers data up to the end of the previous week.
month: Considers data up to the end of the previous month.
year: Considers data up to the end of the previous year.
Enable/disable model run
When you have various interdependent models, you might want to run only the required ones for a specific output.
RudderStack provides the enable_status parameter which lets you specify whether to run a model or not. Using it, you can exclude the unnecessary models from the execution process. You can assign the following values to the enable_status field in your pb_project.yaml file:
good_to_have (default): It will not execute or cause a failure when it is not possible to execute the model.
must_have: It will cause a failure when is not possible to execute the model.
not_needed: The model gets disabled if it has no dependency on the final output. It is the default value for a default ID stitcher model and ensures that the model is not executed if it is not required.
disabled: The model gets disabled and is not executed.
A sample pb_project.yaml file with enable_status parameter:
Consider a scenario where an ID stitcher model (ids) is dependent on tbl_a and tbl_b, and a feature table model (ft1) depends on the ID Stitcher (ids) and an input model tbl_a.
If the ID stitcher and feature table models are marked as not_needed, there is no need to execute either of them. However, if the feature table is marked as good_to_have/must_have, then all the models must run to create final output.
If tbl_a is set to disabled, the ID stitcher and feature table will not run. If either of them is marked as must_have, the project will run into an error.
Window functions
A window function operates on a window (group) of related rows. It performs calculation on a subset of table rows that are connected to the current row in some way. The window function has the ability to access more than just the current row in the query result.
The window function returns one output row for each input row. The values returned are calculated by using values from the sets of rows in that window. A window is defined using a window specification, and is based on three main concepts:
Window partitioning, which forms the groups of rows (PARTITION BY clause)
Window ordering, which defines an order or sequence of rows within each partition (ORDER BY clause)
Window frames, which are defined relative to each row to further restrict the set of rows (ROWS specification). It is also known as the frame clause.
Snowflake does not enforces users to define the cumulative or sliding frames, and considers ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING as the default cumulative window frame. However, you can override this by defining the frame manually.
On the Redshift aggregate window function list given below, specify the frame_clause while using any function from the list:
AVG
COUNT
CUME_DIST
DENSE_RANK
FIRST_VALUE
LAG
LAST_VALUE
LEAD
LISTAGG
MAX
MEDIAN
MIN
NTH_VALUE
PERCENTILE_CONT
PERCENTILE_DISC
RATIO_TO_REPORT
STDDEV_POP
STDDEV_SAMP (synonym for STDDEV)
SUM
VAR_POP
VAR_SAMP (synonym for VARIANCE)
In the Redshift ranking window functions given below, do not specify the frame_clause while using any function from the list:
DENSE_RANK
NTILE
PERCENT_RANK
RANK
ROW_NUMBER
Use frame_clause carefully when using a window function. While It is not very critical for Snowflake, using it incorrectly in Redshift can lead to errors.
Example of using frame_clause:
- entity_var:name:first_num_b_order_num_bselect:first_value(tbl_c.num_b)# Specify frame clause as aggregate window function is usedfrom:inputs/tbl_cdefault:-1where:tbl_c.num_b >= 10window:order_by:- tbl_c.num_b descframe_clause:ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING- entity_var:name:first_num_b_order_num_b_rankselect:rank()# DO NOT specify frame clause as ranking window function is usedwindow:order_by:- first_num_b_order_num_b asc
Note how frame_clause is specified in first entity_var and not in the second one.
Macros
Macros are the reusable functions that encapsulate complex processing logic directly within the SQL expression. You can create macros to perform computations and use them in multiple models. For example:
This site uses cookies to improve your experience while you navigate through the website. Out of
these
cookies, the cookies that are categorized as necessary are stored on your browser as they are as
essential
for the working of basic functionalities of the website. We also use third-party cookies that
help
us
analyze and understand how you use this website. These cookies will be stored in your browser
only
with
your
consent. You also have the option to opt-out of these cookies. But opting out of some of these
cookies
may
have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This
category only includes cookies that ensures basic functionalities and security
features of the website. These cookies do not store any personal information.
This site uses cookies to improve your experience. If you want to
learn more about cookies and why we use them, visit our cookie
policy. We'll assume you're ok with this, but you can opt-out if you wish Cookie Settings.