# Learning `canonical_id` for Cookie ID In this page, we will explain about cases where the same person is identified by stitching together the `td_client_id` and `td_global_id` of multiple sites. Consider the following data example where there are four sites and `td_client_id` and `td_global_id` are available in all tables. | date | | site_aaa | site_aaa | | site_xxx | site_xxx | | site_yyy | site_yyy | | site_zzz | site_zzz | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | month | day | td_client_id | td_global_id | | td_client_id | td_global_id | | td_client_id | td_global_id | | td_client_id | td_global_id | | 1 | 5 | aaa_001 | 3rd_001 | | | | | yyy_001 | 3rd_001 | | | | | | 15 | aaa_001 | 3rd_002 | | | | | | | | zzz_001 | 3rd_002 | | | 25 | aaa_001 | 3rd_003 | | | | | | | | | | | 2 | 5 | aaa_001 | 3rd_004 | | xxx_001 | 3rd_004 | | | | | | | | | 15 | | | | xxx_001 | 3rd_005 | | yyy_002 | 3rd_005 | | | | | | 25 | | | | | | | yyy_002 | 3rd_006 | | zzz_003 | 3rd_006 | | 3 | 5 | | | | | | | | | | zzz_003 | 3rd_007 | | | 15 | | | | xxx_002 | 3rd_008 | | | | | zzz_003 | 3rd_008 | | | 25 | aaa_002 | 3rd_009 | | xxx_002 | 3rd_009 | | | | | | | | 4 | 5 | aaa_002 | 3rd_010 | | | | | yyy_003 | 3rd_010 | | | | | | 15 | | | | | | | yyy_003 | 3rd_011 | | zzz_004 | 3rd_011 | | | 25 | | | | xxx_003 | 3rd_012 | | | | | zzz_004 | 3rd_012 | | 5 | 5 | aaa_003 | 3rd_013 | | xxx_003 | 3rd_013 | | | | | | | | | 15 | aaa_003 | 3rd_014 | | | | | | | | | | | | 25 | aaa_003 | 3rd_015 | | | | | yyy_004 | 3rd_015 | | zzz_005 | 3rd_015 | | 6 | 5 | aaa_003 | 3rd_016 | | xxx_004 | 3rd_016 | | | | | | | | | 15 | | | | xxx_004 | 3rd_017 | | | | | zzz_005 | 3rd_017 | | | 25 | | | | | | | yyy_005 | 3rd_018 | | zzz_005 | 3rd_018 | It may appear that there are multiple users, but in fact these are cases that can be identified to just one person by connecting them with `td_client_id` and td_global_id across the site. ![canonical_id_1](/assets/4-1-1.5a5052076bdfcc173c6faefe53fea26a3184b2036180eea3d97e9a246acf0ce6.22ff5745.png) If ID Unification is performed with this data as input, only one `canonical_id` is expected to be generated. Let's check this in practice. ## Data Preparation We proceed on the assumption that the above data is stored in TD tables with the names site_aaa, site_xxx, site_yyy and site_zzz for each site, as in the site_aaa table below. This table can be generated by using [ Workflow Samples](https://github.com/treasure-data/treasure-boxes/tree/master/tool-box/id-unification-samples). ### site_aaa | time | site | td_client_id | td_global_id | | --- | --- | --- | --- | | 2023/01/05 | aaa.jp | aaa_001 | 3rd_001 | | 2023/01/15 | aaa.jp | aaa_001 | 3rd_002 | | 2023/01/25 | aaa.jp | aaa_001 | 3rd_003 | | 2023/02/05 | aaa.jp | aaa_001 | 3rd_004 | | 2023/02/15 | aaa.jp | | | | 2023/02/25 | aaa.jp | | | | 2023/03/05 | aaa.jp | | | | 2023/03/15 | aaa.jp | | | | 2023/03/25 | aaa.jp | aaa_002 | 3rd_009 | | 2023/04/05 | aaa.jp | aaa_002 | 3rd_010 | | 2023/04/15 | aaa.jp | | | | 2023/04/25 | aaa.jp | | | | 2023/05/05 | aaa.jp | aaa_003 | 3rd_013 | | 2023/05/15 | aaa.jp | aaa_003 | 3rd_014 | | 2023/05/25 | aaa.jp | aaa_003 | 3rd_015 | | 2023/06/05 | aaa.jp | aaa_003 | 3rd_016 | | 2023/06/15 | aaa.jp | | | | 2023/06/25 | aaa.jp | | | ### site_xxx | time | site | td_client_id | td_global_id | | --- | --- | --- | --- | | 2023/01/05 | xxx.jp | | | | 2023/01/15 | xxx.jp | | | | 2023/01/25 | xxx.jp | | | | 2023/02/05 | xxx.jp | xxx_001 | 3rd_004 | | 2023/02/15 | xxx.jp | xxx_001 | 3rd_005 | | 2023/02/25 | xxx.jp | | | | 2023/03/05 | xxx.jp | | | | 2023/03/15 | xxx.jp | xxx_002 | 3rd_008 | | 2023/03/25 | xxx.jp | xxx_002 | 3rd_009 | | 2023/04/05 | xxx.jp | | | | 2023/04/15 | xxx.jp | | | | 2023/04/25 | xxx.jp | xxx_003 | 3rd_012 | | 2023/05/05 | xxx.jp | xxx_003 | 3rd_013 | | 2023/05/15 | xxx.jp | | | | 2023/05/25 | xxx.jp | | | | 2023/06/05 | xxx.jp | xxx_004 | 3rd_016 | | 2023/06/15 | xxx.jp | xxx_004 | 3rd_017 | | 2023/06/25 | xxx.jp | | | ### site_yyy | time | site | td_client_id | td_global_id | | --- | --- | --- | --- | | 2023/01/05 | yyy.jp | yyy_001 | 3rd_001 | | 2023/01/15 | yyy.jp | | | | 2023/01/25 | yyy.jp | | | | 2023/02/05 | yyy.jp | | | | 2023/02/15 | yyy.jp | yyy_002 | 3rd_005 | | 2023/02/25 | yyy.jp | yyy_002 | 3rd_006 | | 2023/03/05 | yyy.jp | | | | 2023/03/15 | yyy.jp | | | | 2023/03/25 | yyy.jp | | | | 2023/04/05 | yyy.jp | yyy_003 | 3rd_010 | | 2023/04/15 | yyy.jp | yyy_003 | 3rd_011 | | 2023/04/25 | yyy.jp | | | | 2023/05/05 | yyy.jp | | | | 2023/05/15 | yyy.jp | | | | 2023/05/25 | yyy.jp | yyy_004 | 3rd_015 | | 2023/06/05 | yyy.jp | | | | 2023/06/15 | yyy.jp | | | | 2023/06/25 | yyy.jp | yyy_005 | 3rd_018 | ### site_zzz | time | site | td_client_id | td_global_id | | --- | --- | --- | --- | | 2023/01/05 | zzz.jp | | | | 2023/01/15 | zzz.jp | zzz_001 | 3rd_002 | | 2023/01/25 | zzz.jp | | | | 2023/02/05 | zzz.jp | | | | 2023/02/15 | zzz.jp | | | | 2023/02/25 | zzz.jp | zzz_003 | 3rd_006 | | 2023/03/05 | zzz.jp | zzz_003 | 3rd_007 | | 2023/03/15 | zzz.jp | zzz_003 | 3rd_008 | | 2023/03/25 | zzz.jp | | | | 2023/04/05 | zzz.jp | | | | 2023/04/15 | zzz.jp | zzz_004 | 3rd_011 | | 2023/04/25 | zzz.jp | zzz_004 | 3rd_012 | | 2023/05/05 | zzz.jp | | | | 2023/05/15 | zzz.jp | | | | 2023/05/25 | zzz.jp | zzz_005 | 3rd_015 | | 2023/06/05 | zzz.jp | | | | 2023/06/15 | zzz.jp | zzz_005 | 3rd_017 | | 2023/06/25 | zzz.jp | zzz_005 | 3rd_018 | ### id_unification_ex1.dig ```yaml +call_unification: http_call>: https://api-cdp.treasuredata.com/unifications/workflow_call headers: - authorization: ${secret:td.apikey} method: POST retry: true content_format: json content: run_canonical_ids: true run_enrichments: true run_master_tables: true full_refresh: true keep_debug_tables: true unification: !include : unification_ex1.yml ``` ### unification_ex1.yml ```yaml name: test_id_unification_ex1 keys: - name: td_client_id invalid_texts: [''] - name: td_global_id valid_regexp: "3rd_*" invalid_texts: [''] tables: - database: test_id_unification_ex1 table: site_aaa key_columns: - {column: td_client_id, key: td_client_id} - {column: td_global_id, key: td_global_id} - database: test_id_unification_ex1 table: site_xxx key_columns: - {column: td_client_id, key: td_client_id} - {column: td_global_id, key: td_global_id} - database: test_id_unification_ex1 table: site_yyy key_columns: - {column: td_client_id, key: td_client_id} - {column: td_global_id, key: td_global_id} - database: test_id_unification_ex1 table: site_zzz key_columns: - {column: td_client_id, key: td_client_id} - {column: td_global_id, key: td_global_id} canonical_ids: - name: unified_cookie_id merge_by_keys: [td_client_id, td_global_id] merge_iterations: 5 master_tables: - name: master_table_ex1 canonical_id: unified_cookie_id attributes: - name: td_client_id invalid_texts: [''] array_elements: 5 source_columns: - {table: site_aaa, order: first, order_by: td_client_id, priority: 1} - {table: site_xxx, order: first, order_by: td_client_id, priority: 2} - {table: site_yyy, order: first, order_by: td_client_id, priority: 3} - {table: site_zzz, order: first, order_by: td_client_id, priority: 4} - name: td_global_id valid_regexp: "3rd_*" invalid_texts: [''] source_columns: - {table: site_aaa, order: last, order_by: time, priority: 1} - {table: site_xxx, order: last, order_by: time, priority: 1} - {table: site_yyy, order: last, order_by: time, priority: 1} - {table: site_zzz, order: last, order_by: time, priority: 1} ``` ## Explanation of the Unification Algorithm While this algorithm is commonly referred to as the Union-Find Algorithm, it will be referred to as the Unification Algorithm in this document. The Unification Algorithm represents the relationships between keys as a directed graph and transforms the graph into a structure that allows the system to identify groups of keys as representing the same individual through iterative processing. ### Initial State of the Graph The `+extract_and_merge` task in the workflow generates the `graph_unify_loop_0` table. This represents the initial state of the graph, illustrating the "follower -> leader" relationships between keys. ![canonical_id_1](/assets/4-2-1.11627f59edd749d7d10eb6d38d4a8ffe9e1c6f7395d9788b7671199f00949fd7.22ff5745.png) ### graph_unify_loop_0 The initial graph represented by the `graph_unify_loop_0` table is visualized using Graphviz as shown below. ![canonical_id_1](/assets/graph_unify_loop_0.31dcd44919fef1651a95af913c26d5a4d3df2b477ab5bb8c1e95c63ddf062203.22ff5745.png) #### How the graph_unify_loop_0 Table Is Created Let’s examine how the `graph_unify_loop_0` table, which represents the initial graph, is generated. ```yaml canonical_ids: - name: unified_cookie_id merge_by_keys: [td_client_id, td_global_id] merge_iterations: 5 ``` In the `merge_by_keys:` section of `canonical_ids:`, keys used for stitching are listed in order of priority. This priority influences how the graph is constructed. ![canonical_id_1](/assets/4-3-1.32d35e5cdb2b3493c4bf752c2ea2ac21f08c439db68cc495e8f498aaa4ec3270.22ff5745.png) The `graph_unify_loop_0` table is generated from the original data through the following steps: ![canonical_id_1](/assets/4-4-1.5f1ad6d07b3e768a3e617c0a7e2b38abe4bb947c02cf59e9360be0dfe85cd81c.22ff5745.png) 1. Among all key columns, the highest-priority key is designated as the leader. 2. All key columns (including the leader) are expanded into rows as followers paired with their leader. 3. The source table is similarly expanded for all tables. 4. The `graph_unify_loop_0` table is created by performing a `UNION ALL` on all tables and extracting unique pairs of `follower_id`, `follower_ns`, `leader_id`, and `leader_ns`. The `follower_ns` and `leader_ns` specify the key each ID belongs to (in this example, 1 corresponds to `td_client_id`, and 2 to `td_global_id`). The priority of keys in this setup is `[1, 2]`. ### graph_unify_loop_1 Using the graph from `graph_unify_loop_0`, leaders are updated based on the following rules: ![canonical_id_1](/assets/4-5-1.1dbd7b2977bd9e06611c67534ada114ba6044896cc4269b02ed1b09e950a4f15.22ff5745.png) - For each leader, compare its value with all other leaders connected via a single follower. - If a smaller value exists (in terms of string comparison), the leader is replaced by the one with the smaller value. The original leader becomes the old leader and is replaced by the new leader. - An "old_leader -> old_leader" relationship will always exist and must be replaced by "old_leader -> new_leader." This change signifies that the old leader is now a follower of the new leader. The table created after all leaders are updated to their new leaders becomes the `graph_unify_loop_1` table. ![canonical_id_1](/assets/4-6-1.4e835c9ead2923b589f7aed656323be89a58348ff209a5dbb25197396f07d6e0.22ff5745.png) This process results in many leaders being replaced, with old leaders pointing to new leaders as followers. The same process is repeated in subsequent loops. ![canonical_id_1](/assets/graph_unify_loop_1.311b6800465467e137c69e9b6feb01e24292137753afbc8f73e24aaa63faa932.22ff5745.png) ### graph_unify_loop_2 ![canonical_id_1](/assets/graph_unify_loop_2.3b97d86df91568a142043efe023d2b5e6240e3031a72a1e2e7cb1c0ae6a481e3.22ff5745.png) ### graph_unify_loop_3 ![canonical_id_1](/assets/graph_unify_loop_3.49921071bf7ba4d8afedccfd8da2ade994bec2650d5ad2e1da817445b42cdaee.22ff5745.png) ### graph_unify_loop_4 ![canonical_id_1](/assets/graph_unify_loop_4.87f3ab5d1bdf9ba6992ab8f9317cd16672a550720271f827cb8682e33f04d9f0.22ff5745.png) ### graph_unify_loop_5 (graph) In this example, with `merge_iterations:` set to 5, `graph_unify_loop_5` represents the final state of the algorithm. Note that this table is output as `graph` rather than `graph_unify_loop_5`. ### Determining Convergence The Unification Algorithm is guaranteed to converge. In `graph_unify_loop_4`, only one leader remains, and all followers point to it. This is considered a converged state. The convergence conditions are: 1. A single leader is consolidated for each individual. 2. All keys point exclusively to this leader. Once converged, a canonical ID can be assigned to all keys for the same individual. In this example, a minimum of 4 iterations (`merge_iterations: 4`) is necessary to achieve convergence. Note Setting a higher value for `merge_iterations:` is not problematic since processing stops upon convergence. However, note that `graph_unify_loop_N` tables are only created up to the point of convergence. ### Impact of Changing merge_by_keys: Order If the order in `merge_by_keys:` is changed to prioritize `td_global_id` as shown below, the graph evolution will differ completely: ```yaml canonical_ids: - name: unified_cookie_id # merge_by_keys: [td_client_id, td_global_id] merge_by_keys: [td_global_id, td_client_id] merge_iterations: 5 ``` ### graph_unify_loop_0 ![graph_unify_loop_0](/assets/graph_unify_loop_0.b355dc6723e15532fa61b82c24885de1aa1e8818c8f6c943eda3fd284cd40397.6c3d669e.png) ### graph_unify_loop_1 ![graph_unify_loop_1](/assets/graph_unify_loop_1.d0dea7e495a9b4adcf037249e4dc604dd9d3b37cd61753f35c29cd30da0e4ff7.6c3d669e.png) ### graph_unify_loop_2 ![graph_unify_loop_2](/assets/graph_unify_loop_2.e1e9c8bda6929d6cf1cd6bd54209220f13c81e2c33ccba0783a43958aec961e7.6c3d669e.png) ### graph_unify_loop_3 ![graph_unify_loop_3](/assets/graph_unify_loop_3.d8d8f30836d1be4c4a5b2bd5666c56aa70bdcb9ecdbd31e7a485e46e588a1a8a.6c3d669e.png) ### graph_unify_loop_4 ![graph_unify_loop_4](/assets/graph_unify_loop_4.06b70fc07a69fca353ec0731173e9b220999b1b7651535d80aaeca85f6045cf3.6c3d669e.png) ### Is the order good for the least number of loops? You might think that a good `merge_by_keys:` order is one that minimises the number of loops, but this is not true. (To begin with, finding the order of keys that minimises the number of loops is not realistic, as you have to run every combination of the order at least once.) The important thing is not the number of loops, but to set the order so that the `canonical_id` generated for each individual is as invariant as possible (for subsequent workflow updates). ## Generation and Assignment of `canonical_id` The process of assigning `canonical_id` to the final graph after running the Unification Algorithm is executed through the `+canonicalize` task. For each user, a `canonical_id` is generated based on the leader's value, and the same ID is distributed to all its followers. The `canonical_id` is derived based on the value of the leader key (`graph.leader_id`) using the following calculation: ```SQL to_base64url( to_big_endian_64( bitwise_xor( from_big_endian_64( substr( sha256(to_utf8(graph.leader_id)), 1, 8 ) ), leader_keys.key_mask_low64i ) ) || leader_keys.key_mask_high8b ) ``` ![canonical_id_1](/assets/4-7-1.7fe6af8afd570261b9ebd82c29731b36f678afb68bae7e3e639eb8adfb7afc15.22ff5745.png) In the above example, `canonical_id` for User1 and User2 is generated based on the `td_client_id` that became the leader. In contrast, User3's `td_client_id` is absent, so the leader is set to the next prioritized value (`td_global_id`), and the `canonical_id` is generated accordingly. The generated `canonical_id` is also assigned to all connecting followers, ensuring that one leader and all its followers share the same `canonical_id`. The resulting lookup table records this information, allowing you to see the `canonical_id` assigned to each key. ### Is `canonical_id` Immutable? The `canonical_id` is not guaranteed to remain consistent (for the same user) across every run of the Unification Workflow. ![canonical_id_1](/assets/4-8-1.aa0c5b359ff2e91ab905b0d77ffe9142701cf8554bb8a5597b7e2fef85bb0866.22ff5745.png) In the next update, suppose a smaller value leader, `"aaa_00"`, appears for User1. In this case, User1's leader becomes `"aaa_00"` (as the smallest value becomes the sole leader), and a different `canonical_id` is generated. A different `canonical_id` implies a different individual, indicating the emergence of a new user. For example, in Audience Studio, which uses this `canonical_id`, the `cdp_customer_id` for User1 would change. ### Determining the Order of `merge_by_keys` It is crucial to ensure that the `canonical_id` remains as consistent as possible across updates to the Unification Workflow. The order of keys in `merge_by_keys:` has a significant impact on the immutability of `canonical_id`. The Unification Algorithm prioritizes the keys specified in `merge_by_keys:` in order, setting the earliest (highest priority) key as the leader whenever possible. This means the key at the top of the list is often chosen as the leader for each user. Therefore, `merge_by_keys:` should be set with keys ordered from those least likely to change over time to ensure stability. For users who lack the highest-priority key, the second-priority key becomes the leader, making the order of keys critically important. ### Examples of `merge_by_keys:` Settings - Good: `merge_by_keys: [email, td_global_id]` - `email` is less likely to change or increase for individuals, ensuring the same value is selected across updates (and thus a consistent `canonical_id`). - Good: `merge_by_keys: [member_id, email, td_global_id]` - `member_id` or `customer_id` is stable for individuals, ensuring the same value is selected across updates (and thus a consistent `canonical_id`). - Bad: `merge_by_keys: [td_client_id, email, td_global_id]` - `td_client_id` may change over time. If a newly generated value is smaller than the previous minimum, it becomes the leader, resulting in a different `canonical_id` from the previous run. ## ID Unification Output Here, we will closely examine the tables outputted in this example. This will provide more detailed insights than the explanation provided in [Unification Output](/en/tools/idu/p2_output_result), including the meaning of each column in the tables. ### Database for Storing Results A database named `cdp_unification_${name}`, based on the `name:` value defined in `unification_ex1.yml`, is created to store the results. In this example, the database is named `cdp_unification_test_id_unification_ex1`. ```yaml name: test_id_unification_ex1 # Unification Name keys: - name: td_client_id - name: td_global_id tables: - database: test_id_unification_ex1 table: ex1_site_aaa key_columns: - {column: td_client_id, key: td_client_id} - {column: td_global_id, key: td_global_id} ``` ![canonical_id_1](/assets/4-9-1.3da941787668a73121f1a3b03aec88f5a08bba6012334dc6c901481a2d4678be.22ff5745.png) ### Tables Utilizing the Results #### master_table In this example, the table is named `master_table_ex1`. ```yaml master_tables: - name: master_table_ex1 canonical_id: unified_cookie_id attributes: - name: td_client_id invalid_texts: [''] array_elements: 5 source_columns: - {table: site_aaa, order: first, order_by: td_client_id, priority: 1} - {table: site_xxx, order: first, order_by: td_client_id, priority: 2} - {table: site_yyy, order: first, order_by: td_client_id, priority: 3} - {table: site_zzz, order: first, order_by: td_client_id, priority: 4} - name: td_global_id valid_regexp: "3rd_*" invalid_texts: [''] source_columns: - {table: site_aaa, order: last, order_by: time, priority: 1} - {table: site_xxx, order: last, order_by: time, priority: 1} - {table: site_yyy, order: last, order_by: time, priority: 1} - {table: site_zzz, order: last, order_by: time, priority: 1} ``` The table specified under `name:` in `master_tables:` is output. This is used as the `master_table` in the Master Segment of the Audience Studio. Below is the result from this example. ##### Example Result | unified_cookie_id | td_client_id | td_global_id | time | | --- | --- | --- | --- | | tRf8bJZ_0EC1 | ["aaa_001", "aaa_001", "aaa_001", "aaa_001", "aaa_002"] | 3rd_018 | 1691675802 | A record is generated for each `canonical_id`. In this case, there is only one record. The `time` column indicates when the table was created. ![canonical_id_1](/assets/4-10-1.ab03bc48624e0c0a6a3e16e56b55fb772ebafcf3ab6a304fc05afb24c2d60510.22ff5745.png) Apart from the `canonical_id`, values specified in `attributes:` are stored. The `td_client_id` column, for example, picks up a maximum of five values based on the table order and record sequence defined in `source_tables:`. However, as shown in this example, duplicate values may occur, so caution is needed. #### enriched_ Table In this example, the tables are named `enriched_site_aaa`, `enriched_site_xxx`, `enriched_site_yyy`, and `enriched_site_zzz`. ```yaml tables: - database: test_id_unification_ex1 table: site_aaa key_columns: - {column: td_client_id, key: td_client_id} - {column: td_global_id, key: td_global_id} - database: test_id_unification_ex1 table: site_xxx key_columns: - {column: td_client_id, key: td_client_id} - {column: td_global_id, key: td_global_id} - database: test_id_unification_ex1 table: site_yyy key_columns: - {column: td_client_id, key: td_client_id} - {column: td_global_id, key: td_global_id} - database: test_id_unification_ex1 table: site_zzz key_columns: - {column: td_client_id, key: td_client_id} - {column: td_global_id, key: td_global_id} ``` Tables named `enriched_${source_table_name}` are output (four in this example). These are used in the Master Segment as `attribute_table` or `behavior_table`. ##### Example Result Below is the `enriched_site_aaa` table. | time | site | td_client_id | td_global_id | unified_cookie_id | | --- | --- | --- | --- | --- | | 1672876800 | aaa.jp | aaa_001 | 3rd_001 | tRf8bJZ_0EC1 | | 1673740800 | aaa.jp | aaa_001 | 3rd_002 | tRf8bJZ_0EC1 | | 1674604800 | aaa.jp | aaa_001 | 3rd_003 | tRf8bJZ_0EC1 | | 1675555200 | aaa.jp | aaa_001 | 3rd_004 | tRf8bJZ_0EC1 | | 1676419200 | aaa.jp | | | | | 1677283200 | aaa.jp | | | | | ... | ... | ... | ... | ... | ![canonical_id_1](/assets/4-11-1.42aa664eb617519050d8c37b6fa2e2148e7d844e5c2398928fb22195f3bdf868.22ff5745.png) This translation preserves the structure and technical accuracy of the original content while ensuring clarity in English. Let me know if you need further adjustments! ### Tables to Check If Unification Was Successful #### `source_key_stats` Table In this example, the table is named `unified_cookie_id_source_key_stats`. This table outputs statistics from the source table (e.g., the number of unique keys for each column). Only the `total_distinct` column is calculated from the `unified_cookie_id_graph_unify_loop_0` table. This table appends records with each execution. To retrieve the latest record, execute the following query: ```SQL SELECT * FROM ( SELECT *, RANK() OVER (ORDER BY time DESC) AS id FROM unified_cookie_id_source_key_stats ) WHERE id = 1 ORDER BY from_table ``` ##### Example Result The `id` column was added by the above SQL query. | from_table | total_distinct | distinct_td_client_id | distinct_td_global_id | time | id | | --- | --- | --- | --- | --- | --- | | * | 16 | 16 | 18 | 1691675802 | 1 | | site_aaa | 12 | 3 | 10 | 1691675802 | 1 | | site_xxx | 11 | 4 | 8 | 1691675802 | 1 | | site_yyy | 12 | 5 | 7 | 1691675802 | 1 | | site_zzz | 13 | 4 | 9 | 1691675802 | 1 | ![source_key_stats](/assets/4-12-1.8311032b747b72ba22c50c9441c4c368bafac425e47d1ff97f348b870383d35e.22ff5745.png) #### `result_key_stats` Table In this example, the table is named `unified_cookie_id_result_key_stats`. This table records statistics such as the number of canonical IDs (unique individuals) generated for each table after ID unification. The table appends records with each execution. To retrieve the latest record, execute the following query: ```SQL SELECT * FROM ( SELECT *, RANK() OVER (ORDER BY time DESC) AS id FROM unified_cookie_id_result_key_stats ) WHERE id = 1 ORDER BY from_table ``` ##### Example Results The `id` column was added by the above SQL query. | from_table | total_distinct | distinct_with_td_client_id | distinct_with_td_global_id | histogram_td_client_id | histogram_td_global_id | time | id | | --- | --- | --- | --- | --- | --- | --- | --- | | * | 1 | 1 | 1 | 16:1 | 18:1 | 1691675802 | 1 | | site_aaa | 1 | 1 | 1 | 16:1 | 18:1 | 1691675802 | 1 | | site_xxx | 1 | 1 | 1 | 16:1 | 18:1 | 1691675802 | 1 | | site_yyy | 1 | 1 | 1 | 16:1 | 18:1 | 1691675802 | 1 | | site_zzz | 1 | 1 | 1 | 16:1 | 18:1 | 1691675802 | 1 | ![source_key_stats](/assets/4-13-1.6614aff6539636e63766f65fe3db2dd2f8e1677a16fe0845ad7ef634b8399789.22ff5745.png) #### Column Explanations for `result_key_stats` - **`from_table`**: Indicates the source table for the computed statistics. The `"*"` row represents all tables combined. - **`total_distinct`**: Number of unique canonical IDs in the entire table (if `"*"`) or within each specific table. - **`distinct_with_td_client_id`**: Number of canonical IDs containing at least one `td_client_id`. Excludes IDs not stitched with `td_client_id`. - **`distinct_with_td_global_id`**: Number of canonical IDs containing at least one `td_global_id`. Excludes IDs not stitched with `td_global_id`. - **`histogram_td_client_id`**: Distribution of how many `td_client_ids` were merged into one canonical ID (e.g., `16:1` means 16 `td_client_ids` became one canonical ID). - **`histogram_td_global_id`**: Similar to `histogram_td_client_id` but for `td_global_id`. ### Tables Generated in the Process of the Unification Algorithm The following group of tables is used internally by the workflow. Some are used for tasks such as convergence judgment. #### `graph_unify_loop_N` Table In this example, the table is named `unified_cookie_id_graph_unify_loop_0` (1, 2, 3, 4). This table represents the state of the graph in the Nth loop, frequently referenced in the algorithm's introduction. The last loop is not included here because it has a different name (`graph` Table). ##### Example Result (for `unified_cookie_id_graph_unify_loop_0`) | follower_id | follower_ns | leader_id | leader_ns | follower_first_seen_at | follower_last_seen_at | follower_source_table_ids | follower_last_processed_at | time | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | yyy_002 | 1 | yyy_002 | 1 | 1676419200 | 1677283200 | [7] | 1691675815 | 1691675802 | | 3rd_015 | 2 | zzz_005 | 1 | 1684972800 | 1684972800 | [8, 5, 7] | 1691675815 | 1691675802 | | 3rd_015 | 2 | aaa_003 | 1 | 1684972800 | 1684972800 | [8, 5, 7] | 1691675815 | 1691675802 | | 3rd_015 | 2 | yyy_004 | 1 | 1684972800 | 1684972800 | [8, 5, 7] | 1691675815 | 1691675802 | | ... | ... | ... | ... | ... | ... | ... | ... | ... | #### Column Explanations for `unified_cookie_id_graph_unify_loop_0` - **`follower_id`, `leader_id` Columns**: These columns represent the keys for the follower and leader. - **`follower_ns`, `leader_ns` Columns**: Indicate the namespace for the follower and leader keys. The order of keys in the `merge_by_keys:` configuration determines the values 1, 2, etc. In this example, `td_client_id` is 1, and `td_global_id` is 2. - **`follower_first_seen_at`, `follower_last_seen_at` Columns**: Record the timestamp when the follower's key value first and last appeared (based on the time column) in the source table. - **`follower_source_table_ids` Columns**: An array listing the IDs of source tables containing the follower's key values. Use the `tables` table (`unified_cookie_id_tables` in this example) to determine which source table corresponds to each ID. - **`follower_last_processed_at` Columns**: The timestamp when the follower was last processed in the loop. #### `graph` Table In this example, the table is named `unified_cookie_id_graph`. It represents the graph's state in the final specified loop. ### Tables for Mapping `canonical_id` to Keys #### `lookup` Table In this example, the table is named `unified_cookie_id_lookup`. This table stores all key values in a single row and allows the lookup of a key's `canonical_id`. It is used to create the master and enriched tables. ![source_key_stats](/assets/4-14-1.eb606c2bdffc64cf2f2cfbf39e8833f23149239a381621574a8e947aa3ebb69e.22ff5745.png) ##### Example Result | `canonical_id` | id | id_key_type | canonical_id_first_seen_at | canonical_id_last_seen_at | id_first_seen_at | id_last_seen_at | id_source_table_ids | id_last_processed_at | time | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | tRf8bJZ_0EC1 | 3rd_016 | 2 | 1672876800 | 1687651200 | 1685923200 | 1685923200 | [5, 6] | 1691675863 | 1691675802 | | tRf8bJZ_0EC1 | aaa_001 | 1 | 1672876800 | 1687651200 | 1672876800 | 1675555200 | [5] | 1691675815 | 1691675802 | | tRf8bJZ_0EC1 | yyy_003 | 1 | 1672876800 | 1687651200 | 1680652800 | 1681516800 | [7] | 1691675851 | 1691675802 | | tRf8bJZ_0EC1 | 3rd_010 | 2 | 1672876800 | 1687651200 | 1680652800 | 1680652800 | [7, 5] | 1691675851 | 1691675802 | Columns not detailed below share the same meanings as in the `unified_cookie_id_graph_unify_loop_0` table. #### Column Explanations for `lookup` - **`id_key_type` Column**: Indicates the key type (number) corresponding to the `id`. For mapping between `key_type` and `key_name`, refer to the `keys` table (`unified_cookie_id_keys` in this example). - **`id_source_table_ids` Column**: Lists the source tables containing the `id`. Use the `tables` table (`unified_cookie_id_tables` in this example) to map each `table_id` to a `table_name`. #### `keys` Table (`unified_cookie_id_keys`) This table maps each key to its corresponding `id` (1, 2, ...). It is used to identify the `key_name` from the `id_key_type` column in the `lookup` table. ##### Example Result | key_type | key_name | time | | --- | --- | --- | | 1 | td_client_id | 1691675802 | | 2 | td_global_id | 1691675802 | #### `tables` Table (`unified_cookie_id_tables`) This table maps each source table to an assigned `id`. It is used to identify the `table_name` from the `id_source_table_ids` column in the `lookup` table. ##### Example Result | table_id | table_name | time | | --- | --- | --- | | 5 | site_aaa | 1691675802 | | 6 | site_xxx | 1691675802 | | 7 | site_yyy | 1691675802 | | 8 | site_zzz | 1691675802 |