Get a high level aggregate analysis of your ID graph across entities and a deep dive into the analysis of single IDs.
11 minute read
Profiles Copilot is an added feature embedded within the PB core CLI tool and helps you analyze the health of your ID graph.
Profiles Copilot is only available in pb version 0.19.0.
This feature provides the following insights:
A high level aggregate analysis of your overall ID graph across your entities.
The option to select a single customer and provide a drilled-down analysis and visualization about a customer’s individual graph.
An optional interactive LLM session where you can ask questions about your ID graph and get answers to them.
Prerequisites
Profiles v0.19.0 installed locally.
profiles-mlcorelib library (v0.6.0 or above) installed in your Python environment (v3.9.0 to v3.11.10) using pip install profiles-mlcorelib. Note that it should be the same Python environment as the profiles-rudderstack library.
Generate a Personal Access Token from the RudderStack dashboard. Then, add it to your site configuration under the rudderstack_access_token key (similar to the connections key).
A successful Profiles project run on the same pb version at least once that has generated an ID graph. Also, the ID graph definition should not have changed after the run.
Workflow overview
To use this feature, go to your Profiles project configuration and run the below command within your terminal:
pb audit id_stitcher
The audit tool then:
Scans the project to look for ID stitcher models.
Asks the customers to select one model in case there are multiple ID stitcher models.
For the selected model, presents the below three audit stages:
The pb audit id_stitcher command gives you a detailed report on the ID stitcher as an output. It contains the ID graph summary that helps you understand:
ID graph’s definition, like ID types, main ID, etc.
Count of values, like:
Counts of each ID types before and after ID stitching.
Most widely connected IDs.
Resolved main ID stats on the number of IDs connected together, like average, 99th percentile, etc.
IDs that are never connected to any other ID.
Note that for all the counts, the tool shows some samples but they are at an aggregate level, not on an individual main_id level.
Click here to view a sample report. Note that the exact text and format may differ.
main id for entity user:USER_MAIN_IDAnalyzing ID Stitcher for entity:userNode types:['id','anonymous_id','contact_id','email','lead_id','user_id']Unique IDs of each type and their counts:id:10958anonymous_id:4294899contact_id:79854email:122661lead_id:75482user_id:296346Total Distinct IDsBefore stitching:4880200After stitching:4290186Top 10 nodes by edge count:ID: alex@example.com, Type: email, Edges:2503ID: alex@example.com, Type: user_id, Edges:2140ID: alice@example.com, Type: email, Edges:1725ID: abc@example.com, Type: email, Edges:946Top 5 nodes of different types, before id stitching, by edge counts:type id by edge count:ID: a2ab0609d6c10187ff11, Edges:2ID: a2ab060a860ee7b512, Edges:2ID: a2ab0605e232d3c609, Edges:2ID: a2ab060bb95dcaef14, Edges:2ID: a2ab0600d159d259, Edges:2type anonymous_id by edge count:ID: 49a8bebe-4d9d-47e5-8ea3-29153fc33dcb, Edges:71ID: d81d2d84-71a8-46db-9d30-81502051c214, Edges:61ID: 3421a313-189e-45d2-9270-e0ae3181e07d, Edges:47type contact_id by edge count:ID: 003QP000001SRUjYAO, Edges:7ID: 003QP000001LpBGYA0, Edges:6type email by edge count:ID: alex@example.com, Edges:2503ID: alice@example.com, Edges:1725type lead_id by edge count:ID: 00Q5x00001zW1Y9EAK, Edges:3ID: 00QQP00000800rR2AQ, Edges:3type user_id by edge count:ID: alex@example.com, Edges:2140ID: 28HhwSx1WEU56QM5sY7r74MEf9n, Edges:1526Average edge count per node (before stitching):1.331919Average edge count by node type (before stitching):id:2.518149anonymous_id:3.122989contact_id:3.068089email:3.406679lead_id:2.453102user_id:4.522037POST ID STITCHING ANALYSISCluster size after stitching:No:of main ids with a single other id (Singleton nodes):3998772Average other id counts:1.137526Min other id counts for a single main id:1Max other id counts for a single main id:1371125th percentile:150th percentile:175th percentile:190th percentile:199th percentile:4Average number of ids of different id types, per main id, after stitching:id:1.059563anonymous_id:1.026212contact_id:1.031319email:1.049362lead_id:1.104071user_id:1.149782Top 5 biggest clusters after id stitching (and the distinct id types in each cluster):Main ID: rid0e06a33cfdbccc7679e635e78f0f391f, Size:13711anonymous_id:11503user_id:1205email:457lead_id:375contact_id:117id:54Main ID: rid039ea1d83a8484fa66f153143f013ac7, Size:1526anonymous_id:1525user_id:1Main ID: rid25ddb86517babaa3f45dceef3c400330, Size:964anonymous_id:949user_id:9email:2contact_id:2lead_id:2Main ID: ridf0fef860592708f92f211df285bef625, Size:833anonymous_id:618user_id:88email:47lead_id:47contact_id:32id:1Main ID: rid2e9139338ef8ccfec085e259da0ef3fb, Size:719anonymous_id:718user_id:1Singleton Node Analysis (after stitching):anonymous_id:3963171nodes (92.28%) not connected to any other ID typeuser_id:28686nodes (9.68%) not connected to any other ID typecontact_id:5376nodes (6.73%) not connected to any other ID typelead_id:1374nodes (1.82%) not connected to any other ID typeid:165nodes (1.51%) not connected to any other ID typeChecking for potential issues:WARNING:__main__:Potential overstitching detected:Node alex@example.com of type email has 2503 edgesChecking for missing direct edges between node types (but indirect edges can exist):WARNING:__main__:Warning:Nodirect edges found between id and anonymous_idWARNING:__main__:Warning:Nodirect edges found between id and contact_idWARNING:__main__:Warning:Nodirect edges found between id and lead_idWARNING:__main__:Warning:Nodirect edges found between id and user_idWARNING:__main__:Warning:Nodirect edges found between anonymous_id and idWARNING:__main__:Warning:Nodirect edges found between anonymous_id and contact_idWARNING:__main__:Warning:Nodirect edges found between anonymous_id and lead_idWARNING:__main__:Warning:Nodirect edges found between contact_id and idWARNING:__main__:Warning:Nodirect edges found between contact_id and anonymous_idWARNING:__main__:Warning:Nodirect edges found between contact_id and user_idWARNING:__main__:Warning:Nodirect edges found between lead_id and idWARNING:__main__:Warning:Nodirect edges found between lead_id and anonymous_idWARNING:__main__:Warning:Nodirect edges found between lead_id and user_idWARNING:__main__:Warning:Nodirect edges found between user_id and idWARNING:__main__:Warning:Nodirect edges found between user_id and contact_idWARNING:__main__:Warning:Nodirect edges found between user_id and lead_idCheck for missing edges between node types (direct or indirect) after id stitching:Nomissing edges found between node types. GREAT!!ANALYSIS COMPLETE FOR ENTITY:user
Cluster analysis
The pb audit id_stitcher command also lets you understand specific clusters (main_ids), for example, how an email ID E1 is linked to a Salesforce ID S1.
The audit tools gives a snapshot of the complete cluster and also points to the IDs that may be problematic.
Click here to view a sample
You can explore specific clusters by entering an ID to see how the other ids are all connected and the cluster is formed.The ID can be either the main ID or any other ID type.Enter an ID to visualize (or 'skip' to skip this step):AcmeYour network visualization is ready! We've saved an interactive map of your data connections here:/Users/user_name/git_repos/profiles-prod/graph_outputs/Acme_graph.htmlYou can open this file in your web browser to explore the network visually.Cluster Summary:Network overview:-----------------1. Size:7ids connected by 9 links2. Connectivity:Onaverage, each id is connected to 2.6 other ids3. Spread:The farthest connected ids are 2 steps apartKey Players:------------ Most connected ids:- ID: acme.com, No:of edges:6- ID: Acme, No:of edges:2- ID: 000123XYZ, No:of edges:2- ID: Acme America, No:of edges:2- ID: 000456ABC, No:of edges:2- ID: 000789DEF, No:of edges:2- ID: Acme Inc, No:of edges:2Critical Linking Nodes, and percent of id pairs they are connecting (these are the nodes that act as bridges across all nodes):- ID: acme.com, % of shortest paths:80.00%
The audit tool also outputs an HTML file that helps you understand the cluster visually. It is stored locally and is loaded proactively by highlighting the path so you can revisit it later. A sample file is shown:
In the above example, you can see how acme.com joins Acme Inc and Acme America entities which are probably different. You can address this by removing the acme.com domain.
Interactive LLM session
The audit tool also provides an open-ended text2sql session where you can ask arbitrary questions related to the ID graph. The question goes to an LLM which prepares a SQL query.
The SQL query then runs on your data warehouse and generates an answer.
Click here to view a sample
Question:how many accounts do we have with more than one domain?Thought:Ran following queries to get the answer:SELECT COUNT(*) AS accounts_with_multiple_domainsFROM (SELECT ACCOUNT_MAIN_IDFROM account_id_graphWHERE other_id_type = 'domain'GROUP BY ACCOUNT_MAIN_IDHAVING COUNT(DISTINCT other_id) > 1) subquery;;SELECT COUNT(DISTINCT ACCOUNT_MAIN_ID) AS total_accounts_with_domainsFROM account_id_graphWHERE other_id_type = 'domain';;Response:Now we have more context to provide a comprehensive answer:There are 100 accounts that have more than one domain associated with them. This is out of a total of 25,000 accounts that have at least one domain.To put this into perspective:- Approximately 0.50% of accounts with domains (100 out of 25,000) have multiple domains associated with them.- The vast majority (about 99.75%) of accounts with domains are associated with just a single domain.This distribution suggests that while most accounts in the system are linked to a single domain, there is a small but significant number of accounts that operate across multiple domains. These could represent larger organizations, companies with multiple brands, or entities that have expanded their online presence across different domain names.You can ask next question, type exit to quit, or back to select a different entityQ:what are these accounts? why do they have multiple domains?INFO:api_server.sessions.chat_session@2024-09-30 11:02:28,987: Getting response for prompt:what are these accounts? why do they have multiple domains?INFO:api_server.sessions.chat_session@2024-09-30 11:02:28,988:Converted history to promptINFO:api_server.sessions.chat_session@2024-09-30 11:02:28,988:Making llm calls with the extended promptResponse:Based on the query results, we can answer the user's question:These are the top 10 accounts with multiple domains, ordered by the number of domains they have:1. An account (ACCOUNT_MAIN_ID:rid0a615df48d4ae923f518ab3d052ded1f) with 8 domains, including <domain1>, <domain2>, <domain3>, and others.2. Apex (ACCOUNT_MAIN_ID:rid7492d64d982d1e6b8531e12a39513d6e) with 4 domains across different countries.3. Aurora (ACCOUNT_MAIN_ID:rid149ba1fd2a2c59b30663fb22d4aabfaa) with 4 domains in different countries.4. Acme (ACCOUNT_MAIN_ID:rid49a8de71ef3cbe73d0359965e000a20b) with 3 domains in different Nordic countries.5. A company (ACCOUNT_MAIN_ID:rid0a256a02fc2aaf3bf812db4df78b95fa) with 3 domains.6. Clement (ACCOUNT_MAIN_ID:rid44cb9148b8c8b5b608d5a8ecebc11e4b) with 3 related domains.7. Hero (ACCOUNT_MAIN_ID:rid301a9453dddc0bca3a7315af38370c3e) with 3 domains in different countries.8. Buffalo (ACCOUNT_MAIN_ID:rid0c22fb973e9b75dd2356072f29c4f91e) with 3 related domains.9. Majority (ACCOUNT_MAIN_ID:rid5d63f8165e5fa418cef359f0dcde888a) with 2 domains.10. Super (ACCOUNT_MAIN_ID:rid21bb08ad7fd688b23e43d00b0b54d3c0) with 2 domains.As for why these accounts have multiple domains, we can infer several reasons based on the patterns observed:1. International presence:Companies like Apex, Aurora, Acme, and Buffalo have domains for different countries or regions.2. Brand diversification:The first account with 8 domains seems to represent multiple brands or services under one parent company....response continued...These multiple domains allow companies to tailor their online presence to different markets, products, or strategies while maintaining a unified backend infrastructure for user identification and data management.
Privacy policy
For an interactive LLM session, RudderStack sends the user prompt and some data (for example, table and column names, SQL query response, etc.) to an external LLM service.
RudderStack currently uses two LLM service providers:
Both the OpenAI and Anthropic privacy policies state that the data in APIs is retained for 30 days to monitor for any policy violations.
It will not be used for training purposes and will be deleted after 30 days.
The tool asks for your consent to use these LLMs before making the first LLM call. You can opt out of the LLM services if there are concerns about sending your data to the external providers.
If you opt out of the LLM services, the audit tool never makes any calls to the LLM provider.
Any consent you provide (opt-in or opt-out) is stored locally in the ~/.pb/preferences.yaml file on the same device and gets reused in all the future profiles audit runs.
To reset your consent, manually delete the consent from the ~/.pb/preferences.yaml file.
Tracking
Note that for every command run, the tool tracks the audit_start and audit_end events.
It creates an anonymous_id in the ~/.pb/preferences.yaml file and uses this ID for future runs. All the runs get associated with this ID. On the very first run when anonymous_id is created, you will see the following disclaimer:
Privacy Policy regarding Telemetry:
- This tool collects usage statistics to help us improve the product.
- We do not collect or store any sensitive information from your models.
- You can opt out by adding 'telemetry_opted_out: true' to ~/.pb/preferences.yaml
The audit tool also collects the IP address of the device as a part of the standard RudderStack call. However, it does not collect any other user identifier.
You can add a line telemetry_opted_out: true in this preference file to opt out of tracking.
This site uses cookies to improve your experience while you navigate through the website. Out of
these
cookies, the cookies that are categorized as necessary are stored on your browser as they are as
essential
for the working of basic functionalities of the website. We also use third-party cookies that
help
us
analyze and understand how you use this website. These cookies will be stored in your browser
only
with
your
consent. You also have the option to opt-out of these cookies. But opting out of some of these
cookies
may
have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This
category only includes cookies that ensures basic functionalities and security
features of the website. These cookies do not store any personal information.
This site uses cookies to improve your experience. If you want to
learn more about cookies and why we use them, visit our cookie
policy. We'll assume you're ok with this, but you can opt-out if you wish Cookie Settings.