# Learning `canonical_id` for Cookie ID and Email This page is based on the previous page [Learning `canonical_id` for Cookie ID](/en/tools/idu/p3_canonical_id_learning), so some duplicate explanations have been omitted; please make sure you have read through the previous page first before continuing. This page deals with a case where the same person is identified by stitching together the `td_client_id`, `td_global_id`, `td_ssc_id` and email of multiple sites. There are four sites, but each table has a different combination of keys that it has. | | | site_aaa | site_aaa | site_aaa | | site_xxx | site_xxx | | site_yyy | site_yyy | | site_zzz | site_zzz | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 日付 | | td_client_id | td_global_id | td_ssc_id | | td_ssc_id | td_global_id | | email | td_ssc_id | | td_client_id | email | | 1月 | 5日 | aaa_001 | 3rd_001 | ssc_001 | | | | | a@ex.com | ssc_001 | | | | | | 15日 | aaa_001 | 3rd_002 | ssc_001 | | | | | | | | zzz_001 | a@ex.com | | | 25日 | aaa_001 | 3rd_003 | ssc_001 | | | | | | | | | | | 2月 | 5日 | aaa_001 | 3rd_004 | ssc_001 | | ssc_001 | 3rd_004 | | | | | | | | | 15日 | | | | | ssc_001 | 3rd_005 | | a@ex.com | ssc_001 | | | | | | 25日 | | | | | | | | a@ex.com | ssc_001 | | zzz_003 | a@ex.com | | 3月 | 5日 | | | | | | | | | | | zzz_003 | a@ex.com | | | 15日 | | | | | ssc_001 | 3rd_008 | | | | | zzz_003 | a@ex.com | | | 25日 | aaa_002 | 3rd_009 | ssc_002 | | ssc_001 | 3rd_009 | | | | | | | | 4月 | 5日 | aaa_002 | 3rd_010 | ssc_002 | | | | | b@ex.com | ssc_003 | | | | | | 15日 | | | | | | | | b@ex.com | ssc_003 | | zzz_004 | b@ex.com | | | 25日 | | | | | ssc_002 | 3rd_010 | | | | | zzz_004 | c@ex.com | | 5月 | 5日 | aaa_003 | 3rd_013 | ssc_003 | | ssc_002 | 3rd_013 | | | | | | | | | 15日 | aaa_003 | 3rd_014 | ssc_003 | | | | | | | | | | | | 25日 | aaa_003 | 3rd_015 | ssc_004 | | | | | c@ex.com | ssc_003 | | zzz_005 | c@ex.com | | 6月 | 5日 | aaa_003 | 3rd_016 | ssc_004 | | ssc_003 | 3rd_016 | | | | | | | | | 15日 | | | | | ssc_003 | 3rd_017 | | | | | zzz_005 | c@ex.com | | | 25日 | | | | | | | | c@ex.com | ssc_004 | | zzz_005 | c@ex.com | ## Data Preparation The above data is assumed to be stored in TD tables with the names site_aaa, site_xxx, site_yyy and site_zzz for each site, as in the site_aaa table below. This table can be generated by using [Workflow Samples](https://github.com/treasure-data/treasure-boxes/tree/master/tool-box/id-unification-samples). ### site_aaa | time | site | td_client_id | td_global_id | td_ssc_id | | --- | --- | --- | --- | --- | | 2023/01/05 | aaa.jp | aaa_001 | 3rd_001 | ssc_001 | | 2023/01/15 | aaa.jp | aaa_001 | 3rd_002 | ssc_001 | | 2023/01/25 | aaa.jp | aaa_001 | 3rd_003 | ssc_001 | | 2023/02/05 | aaa.jp | aaa_001 | 3rd_004 | ssc_001 | | 2023/02/15 | aaa.jp | | | | | 2023/02/25 | aaa.jp | | | | | 2023/03/05 | aaa.jp | | | | | 2023/03/15 | aaa.jp | | | | | 2023/03/25 | aaa.jp | aaa_002 | 3rd_009 | ssc_002 | | 2023/04/05 | aaa.jp | aaa_002 | 3rd_010 | ssc_002 | | 2023/04/15 | aaa.jp | | | | | 2023/04/25 | aaa.jp | | | | | 2023/05/05 | aaa.jp | aaa_003 | 3rd_013 | ssc_003 | | 2023/05/15 | aaa.jp | aaa_003 | 3rd_014 | ssc_003 | | 2023/05/25 | aaa.jp | aaa_003 | 3rd_015 | ssc_004 | | 2023/06/05 | aaa.jp | aaa_003 | 3rd_016 | ssc_004 | | 2023/06/15 | aaa.jp | | | | | 2023/06/25 | aaa.jp | | | | ### site_xxx | time | site | td_ssc_id | td_global_id | | --- | --- | --- | --- | | 2023/01/05 | xxx.jp | NULL | NULL | | 2023/01/15 | xxx.jp | NULL | NULL | | 2023/01/25 | xxx.jp | NULL | NULL | | 2023/02/05 | xxx.jp | ssc_001 | 3rd_004 | | 2023/02/15 | xxx.jp | ssc_001 | 3rd_005 | | 2023/02/25 | xxx.jp | NULL | NULL | | 2023/03/05 | xxx.jp | NULL | NULL | | 2023/03/15 | xxx.jp | ssc_001 | 3rd_008 | | 2023/03/25 | xxx.jp | ssc_001 | 3rd_009 | | 2023/04/05 | xxx.jp | NULL | NULL | | 2023/04/15 | xxx.jp | NULL | NULL | | 2023/04/25 | xxx.jp | ssc_002 | 3rd_010 | | 2023/05/05 | xxx.jp | ssc_002 | 3rd_013 | | 2023/05/15 | xxx.jp | NULL | NULL | | 2023/05/25 | xxx.jp | NULL | NULL | | 2023/06/05 | xxx.jp | ssc_003 | 3rd_016 | | 2023/06/15 | xxx.jp | ssc_003 | 3rd_017 | | 2023/06/25 | xxx.jp | NULL | NULL | ### site_yyy | time | site | email | td_ssc_id | | --- | --- | --- | --- | | 2023/01/05 | yyy.jp | a@ex.com | ssc_001 | | 2023/01/15 | yyy.jp | NULL | NULL | | 2023/01/25 | yyy.jp | NULL | NULL | | 2023/02/05 | yyy.jp | NULL | NULL | | 2023/02/15 | yyy.jp | a@ex.com | ssc_001 | | 2023/02/25 | yyy.jp | a@ex.com | ssc_001 | | 2023/03/05 | yyy.jp | NULL | NULL | | 2023/03/15 | yyy.jp | NULL | NULL | | 2023/03/25 | yyy.jp | NULL | NULL | | 2023/04/05 | yyy.jp | b@ex.com | ssc_003 | | 2023/04/15 | yyy.jp | b@ex.com | ssc_003 | | 2023/04/25 | yyy.jp | NULL | NULL | | 2023/05/05 | yyy.jp | NULL | NULL | | 2023/05/15 | yyy.jp | NULL | NULL | | 2023/05/25 | yyy.jp | c@ex.com | ssc_003 | | 2023/06/05 | yyy.jp | NULL | NULL | | 2023/06/15 | yyy.jp | NULL | NULL | | 2023/06/25 | yyy.jp | c@ex.com | ssc_004 | ### site_zzz | time | site | td_client_id | email | | --- | --- | --- | --- | | 2023/01/05 | zzz.jp | NULL | NULL | | 2023/01/15 | zzz.jp | zzz_001 | a@ex.com | | 2023/01/25 | zzz.jp | NULL | NULL | | 2023/02/05 | zzz.jp | NULL | NULL | | 2023/02/15 | zzz.jp | NULL | NULL | | 2023/02/25 | zzz.jp | zzz_003 | a@ex.com | | 2023/03/05 | zzz.jp | zzz_003 | a@ex.com | | 2023/03/15 | zzz.jp | zzz_003 | a@ex.com | | 2023/03/25 | zzz.jp | NULL | NULL | | 2023/04/05 | zzz.jp | NULL | NULL | | 2023/04/15 | zzz.jp | zzz_004 | b@ex.com | | 2023/04/25 | zzz.jp | zzz_004 | c@ex.com | | 2023/05/05 | zzz.jp | NULL | NULL | | 2023/05/15 | zzz.jp | NULL | NULL | | 2023/05/25 | zzz.jp | zzz_005 | c@ex.com | | 2023/06/05 | zzz.jp | NULL | NULL | | 2023/06/15 | zzz.jp | zzz_005 | c@ex.com | | 2023/06/25 | zzz.jp | zzz_005 | c@ex.com | ## id_unification_ex2.dig ```yaml +call_unification: http_call>: https://api-cdp.treasuredata.com/unifications/workflow_call headers: - authorization: ${secret:td.apikey} method: POST retry: true content_format: json content: run_canonical_ids: true run_enrichments: true run_master_tables: true full_refresh: true keep_debug_tables: true unification: !include : unification_ex2.yml ``` ## unification_ex2.yml ```yaml name: test_id_unification_ex2 keys: - name: td_client_id - name: td_global_id - name: td_ssc_id - name: email tables: - database: test_id_unification_ex2 table: site_aaa key_columns: - {column: td_client_id, key: td_client_id} - {column: td_global_id, key: td_global_id} - {column: td_ssc_id, key: td_ssc_id} - database: test_id_unification_ex2 table: site_xxx key_columns: - {column: td_ssc_id, key: td_ssc_id} - {column: td_global_id, key: td_global_id} - database: test_id_unification_ex2 table: site_yyy key_columns: - {column: email, key: email} - {column: td_ssc_id, key: td_ssc_id} - database: test_id_unification_ex2 table: site_zzz key_columns: - {column: td_client_id, key: td_client_id} - {column: email, key: email} canonical_ids: - name: person_id merge_by_keys: [email, td_ssc_id, td_client_id, td_global_id] merge_iterations: 5 master_tables: - name: master_table_ex2 canonical_id: person_id attributes: - name: email source_columns: - {table: site_yyy, order: last, order_by: time, priority: 1} - {table: site_zzz, order: last, order_by: time, priority: 1} - name: td_ssc_id array_elements: 5 source_columns: - {table: site_xxx, order: last, order_by: time, priority: 1} - {table: site_yyy, order: last, order_by: time, priority: 2} - name: td_client_id array_elements: 5 source_columns: - {table: site_aaa, order: last, order_by: time, priority: 1} - {table: site_zzz, order: last, order_by: time, priority: 4} - name: td_global_id array_elements: 5 source_columns: - {table: site_aaa, order: last, order_by: time, priority: 1} - {table: site_xxx, order: last, order_by: time, priority: 2} ``` ## Explanation of the Unification Algorithm ### graph_unify_loop_0 The initial state of the graph is as follows: ![canonical_id_1](/assets/graph_unify_loop_0.267789ca87f816a94d553457fd9e78b15b4ca98bc004e8526ee84e7490bde108.2cc0388e.png) #### How the graph_unify_loop_0 Table Is Created The `graph_unify_loop_0` table is created within the `+extract_and_merge` task of the workflow. Let's examine how the graph is constructed. ```yaml canonical_ids: - name: person_id merge_by_keys: [email, td_ssc_id, td_client_id, td_global_id] merge_iterations: 5 ``` In this configuration of `canonical_ids:`, `merge_by_keys:` prioritizes `email` to ensure the creation of the most universal `canonical_id`, followed by `td_ssc_id`. Using the original data, the `graph_unify_loop_0` table is generated through the same steps as in the previous section. ![canonical_id_1](/assets/5-1-1.9dce016df17741adedf962adf63eec67eb4ebf7f035fb19da291cba3e78b447f.2cc0388e.png) Unlike before, pairs for `site_aaa` containing three or more keys in a single table are created. In this case, for a single record, the leader and followers are as follows: - **Leader:** - `td_ssc_id` - **Followers:** - `td_client_id` - `td_global_id` - `td_ssc_id` As a result, the record expands into three pairs for each follower (highlighted in green in the figure above). `follower_ns` and `leader_ns` identify which key corresponds to each id. (In this example, 1 is `td_client_id`, 2 is `td_global_id`, 3 is `td_ssc_id`, and 4 is `email`.) In this configuration, the priority order of `ns` is [4,3,1,2]. ### graph_unify_loop_1 Based on the graph from loop_0, leaders are updated according to the following rules: ![canonical_id_1](/assets/5-2-1.38bf1721fb4ace7d8d02c182df25ccc92279bc5e4f2477a546dea8f7e5430127.2cc0388e.png) The key difference this time is: - If a leader is connected to a higher-priority leader, it will be replaced before leaders of the same priority adjacent to the follower. All leaders are replaced with `new_leader` (unless they are the minimum or maximum priority, in which case they remain unchanged). The table resulting from merging all these replacements is the `graph_unify_loop_1` table. ![canonical_id_1](/assets/5-3-1.72b0f4e42580b1f3196744bdb00210b56b28be893635e0c48acf8d24809f9335.2cc0388e.png) ![canonical_id_1](/assets/graph_unify_loop_1.466952b15e1b5c846b8fdf706c9602d5b1cf81fe96479204413e5186d4407953.2cc0388e.png) Subsequent loops follow the same process with the graph from the previous iteration. ### graph_unify_loop_2 ![canonical_id_1](/assets/graph_unify_loop_2.b1f0e2d8832bf90566c944fc5caed069e2ad70780967b7681174463ccc219969.2cc0388e.png) ### graph_unify_loop_3 The process converges in the third loop. ![canonical_id_1](/assets/graph_unify_loop_3.19c5e8d3532428512ead275f8740675a85dde3f0fc8f7f869785741ad56036e6.2cc0388e.png) ## Generating and Assigning `canonical_id` In this example, the order of keys specified in `merge_by_keys:` prioritizes `email` as the highest. This is because, unlike other keys, the value of `email` is less likely to change over time. For users without an email, the `td_ssc_id` becomes the final leader, and the `canonical_id` is generated based on this value. ![canonical_id_1](/assets/5-4-1.ed3ab470fcf8b15e7891a3dc95568e0bd709a0adae30e4e93236bac63f6d8ab0.2cc0388e.png) ### `canonical_id` Configurations for `master_table` The `canonical_id` used in the `master_table` is generally created with all keys specified in `merge_by_keys:`. However, `canonical_id` can also be created using a subset of the keys. That said, the `canonical_id` configured in the `master_table` must meet the following condition: - The configuration must ensure that a `canonical_id` is assigned to every record in all source tables. In other words, every source table must have at least one key included in `merge_by_keys:`. For this example (Example 2 data), the keys in each table are as follows: - `site_aaa` - `td_client_id` - `td_global_id` - `td_ssc_id` - `site_xxx` - `td_ssc_id` - `td_global_id` - `site_yyy` - `email` - `td_ssc_id` - `site_zzz` - `td_client_id` - `email` Given this, the `merge_by_keys:` configurations valid for the `master_table` are: - Good: `[email, td_ssc_id, td_global_id, td_client_id]` - Good: `[td_ssc_id, td_global_id, td_client_id]` - Good: `[email, td_global_id]` - Good: `[td_ssc_id, td_client_id]` On the other hand, the following configurations will result in errors: - Bad: `[td_global_id, td_client_id]` - Missing keys from the `site_yyy` table. - Bad: `[email, td_client_id]` - Missing keys from the `site_xxx` table. In such cases, an error like the following will appear: ```yaml 400 Bad Request: {"master_tables[0].attributes[0].source_columns[0].table":["is invalid because it doesn't have columns merged into this master_table's canonical_id 'person_id'"],"master_tables[0].attributes[1].sou... ``` For example, in the `[td_global_id, td_client_id]` case, `canonical_id` is only assigned to records generated using `td_global_id` and `td_client_id`. Since the `site_yyy` table does not include these keys, it cannot be assigned a `canonical_id`. At least one key from `merge_by_keys:` (used for stitching) must be present to ensure that every record in all tables is assigned a `canonical_id`.