danger

You are viewing documentation for an older version.

Click here to view the latest documentation.

Identity Stitching

Step-by-step tutorial on how to stitch together different user identities.

This guide provides a detailed walkthrough on how to use a PB project and create output tables in a warehouse for a custom identity stitching model.

Prerequisites

  • Familiarize yourself with:

    • A basic Profile Builder project by following the Profile Builder CLI steps.
    • Structure of a Profile Builder project and the parameters used in different files.

Output

After running the project, you can view the generated material tables:

A sample output containing the results in Snowflake:

Generated tables (Snowflake)
info
Profiles creates a default ID stitcher even if you do not define any specs for creating one. It takes default ID stitcher as the input and all the sources and ID types defined in the file inputs.yaml. When you define the specs, it creates a custom ID stitcher.

Sample project for Custom ID Stitcher

This sample project considers multiple user identifiers in different warehouse tables to ties them together to create a unified user profile. The following sections describe how to define your PB project files:

Project detail

The pb_project.yaml file defines the project details such as name, schema version, connection name and the entities which represent different identifiers.

You can define all the identifiers from different input sources you want to stitch together as a rudder_id (main_id in this example):

# Project name
name: sample_id_stitching
# Project's yaml schema version
schema_version: 49
# Warehouse connection
connection: test
# Allow inputs without timestamps
include_untimed: true
# Folder containing models
model_folders:
  - models
# Entities in this project and their ids.
entities:
  - name: user
    id_stitcher: models/user_id_stitcher # modelRef of custom ID stitcher model
    id_types:
      - main_id # You need to add ``main_id`` to the list only if you have defined ``main_id_type: main_id`` in the id stitcher buildspec.
      - user_id # one of the identifier from your data source.
      - email
# lib packages can be imported in project signifying that this project inherits its properties from there
packages:
  - name: corelib
    url: "https://github.com/rudderlabs/profiles-corelib/tag/schema_{{best_schema_version}}"
    # if required then you can extend the package definition such as for ID types.

Input

The input file (models/inputs.yaml) file includes the input table references and corresponding SQL for the above-mentioned entities:

inputs:
- name: rsIdentifies
  contract: # constraints that a model adheres to
    is_optional: false
    is_event_stream: true
    with_entity_ids:
      - user
    with_columns:
      - name: timestamp
      - name: user_id
      - name: anonymous_id
      - name: email
  app_defaults:
    table: rudder_events_production.web.identifies # one of the WH table RudderStack generates when processing identify or track events.
    occurred_at_col: timestamp
    ids:
      - select: "user_id" # kind of identity sql to pick this column from above table.
        type: user_id
        entity: user # as defined in project file
        to_default_stitcher: true
      - select: "anonymous_id"
        type: anonymous_id
        entity: user
        to_default_stitcher: true
      - select: "lower(email)" # can use sql.
        type: email
        entity: user
        to_default_stitcher: true
- name: rsTracks
  contract:
    is_optional: false
    is_event_stream: true
    with_entity_ids:
      - user
    with_columns:
      - name: timestamp
      - name: user_id
      - name: anonymous_id
  app_defaults:
    table: rudder_events_production.web.tracks # another table in WH maintained by RudderStack processing track events.
    occurred_at_col: timestamp
    ids:
      - select: "user_id"
        type: user_id
        entity: user
        to_default_stitcher: true
      - select: "anonymous_id"
        type: anonymous_id
        entity: user
        to_default_stitcher: true
info
As seen in the above file, you can use SQL to achieve some complex scenario as well.

Model

Profiles Identity stitching model maps and unifies all the specified identifiers (in pb_project.yaml file) across different platforms. It tracks the user journey uniquely across all the data sources and stitches them together to a rudder_id.

A sample profiles.yaml file specifying an identity stitching model (user_id_stitcher) with relevant inputs:

models:
  - name: user_id_stitcher
    model_type: id_stitcher
    model_spec:
      validity_time: 24h
      entity_key: user
      materialization:
        run_type: incremental
      incremental_timedelta: 12h
      main_id_type: main_id
      edge_sources:
        - from: inputs/rsIdentifies
        - from: inputs/rsTracks
Model specification fields
FieldData typeDescription
validity_timeTimeSpecifies the validity of the model with respect to its timestamp. For example, a model run as part of a scheduled nightly job for 2009-10-23 00:00:00 UTC with validity_time: 24h would still be considered potentially valid and usable for any run requests, which do not require precise timestamps between 2009-10-23 00:00:00 UTC and 2009-10-24 00:00:00 UTC. This specifies the validity of generated feature table. Once the validity is expired, scheduling takes care of generating new tables. For example: 24h for 24 hours, 30m for 30 minutes, 3d for 3 days
entity_keyStringSpecifies the relevant entity from your input.yaml file. For example, here it should be set to user.
materializationListAdds the key run_type: incremental to run the project in incremental mode. This mode considers row inserts and updates from the edge_sources input. These are inferred by checking the timestamp column for the next run. One can provide buffer time to consider any lag in data in the warehouse for the next incremental run like if new rows are added during the time of its run. If you do not specify this key then it’ll default to run_type: discrete.
incremental_timedeltaList(Optional )If materialization key is set to run_type: incremental, then this field sets how far back data should be fetched prior to the previous material for a model (to handle data lag, for example). The default value is 4 days.
main_id_typeProjectRef(Optional) ID type reserved for the output of the identity stitching model, often set to main_id. It must not be used in any of the inputs and must be listed as an id type for the entity being stitched. If you do not set it, it defaults to rudder_id. Do not add this key unless it’s explicitly required, like if you want your identity stitcher table’s main_id column to be called main_id.
edge_sourcesListSpecifies inputs for the identity stitching model as mentioned in the inputs.yaml file.

Questions? Contact us by email or on Slack