# Multiple Canonical IDs Up until the previous page, only one `canonical_id` was created by setting all the keys specified in `merge_by_keys:`. In this page, we introduce how to configure multiple `canonical_id`s. We will use the data from [Learning canonical_id for Cookie ID](/en/tools/idu/p3_canonical_id_learning). ## Case1: Generating Multiple Canonical IDs Independently By specifying multiple `canonical_id`s in `canonical_ids:`, it is possible to generate multiple independent `canonical_id`s simultaneously. However, only one `canonical_id` can be set as the Parent Segment, which is generally the one configured with all keys in `merge_by_keys:`. ```yaml name: test_id_unification_ex3 keys: - name: td_client_id invalid_texts: [''] - name: td_global_id valid_regexp: "3rd_*" invalid_texts: [''] tables: - database: test_id_unification_ex3 table: site_aaa key_columns: - {column: td_client_id, key: td_client_id} - {column: td_global_id, key: td_global_id} - database: test_id_unification_ex3 table: site_xxx key_columns: - {column: td_client_id, key: td_client_id} - {column: td_global_id, key: td_global_id} - database: test_id_unification_ex3 table: site_yyy key_columns: - {column: td_client_id, key: td_client_id} - {column: td_global_id, key: td_global_id} - database: test_id_unification_ex3 table: site_zzz key_columns: - {column: td_client_id, key: td_client_id} - {column: td_global_id, key: td_global_id} canonical_ids: - name: brand_id_ax merge_by_keys: [td_client_id, td_global_id] source_tables: ['site_aaa', 'site_xxx'] merge_iterations: 5 - name: brand_id_yz merge_by_keys: [td_client_id, td_global_id] source_tables: ['site_yyy', 'site_zzz'] merge_iterations: 5 - name: unified_brand_id merge_by_keys: [td_client_id, td_global_id] merge_iterations: 5 master_tables: - name: master_table_ex3 canonical_id: unified_brand_id attributes: - name: td_client_id invalid_texts: [''] source_columns: - {table: 'site_aaa', order: first, order_by: td_client_id, priority: 1} - {table: 'site_xxx', order: first, order_by: td_client_id, priority: 1} - {table: 'site_yyy', order: first, order_by: td_client_id, priority: 1} - {table: 'site_zzz', order: first, order_by: td_client_id, priority: 1} - name: td_global_id valid_regexp: "3rd_*" invalid_texts: [''] source_columns: - {table: 'site_aaa', order: last, order_by: time, priority: 1} - {table: 'site_xxx', order: last, order_by: time, priority: 1} - {table: 'site_yyy', order: last, order_by: time, priority: 1} - {table: 'site_zzz', order: last, order_by: time, priority: 1} ``` `brand_id_ax` is created by performing unification on data from only two sites of the same brand, with the goal of assigning a `canonical_id` specific to this brand. Similarly, `brand_id_yz` is configured for another brand. Additionally, a `unified_brand_id` spanning all brands is created and set for the `master_table`. In this configuration, the following three `canonical_id`s will execute their unification algorithms in parallel: - `brand_id_ax` - `brand_id_yz` - `unified_brand_id` Below are the final graphs for each unification algorithm. ### brand_id_ax_graph ![canonical_id_1](/assets/brand_id_ax_graph.e62af58fe1b93aa1dd8021ac723b64ea300c19ff10a957e62c9f174e731aaa90.1bb9a569.png) ### brand_id_yz_graph ![canonical_id_1](/assets/brand_id_yz_graph.2c175874c18443f6e8be9e707fe78bb7c3583c1726273c02bf3abfb915a79d30.1bb9a569.png) ### unified_brand_id_graph ![canonical_id_1](/assets/unified_brand_id_graph.4cd799a8089317c224d66028f75eca9bb3880e4083474c5773b4da49f6c7ff1e.1bb9a569.png) ### About the Configuration of `source_tables:` In this example, `source_tables:` is specified as follows in the `canonical_ids:` configuration: ```yaml canonical_ids: - name: brand_id_ax merge_by_keys: [td_client_id, td_global_id] source_tables: ['site_aaa', 'site_xxx'] merge_iterations: 5 - name: brand_id_yz merge_by_keys: [td_client_id, td_global_id] source_tables: ['site_yyy', 'site_zzz'] merge_iterations: 5 - name: unified_brand_id merge_by_keys: [td_client_id, td_global_id] merge_iterations: 5 ``` In this setup, the unification algorithm uses only the keys from the tables specified in `source_tables:`. This configuration can be useful when you want to generate `canonical_id`s for specific brands or games in addition to `canonical_id`s spanning all data sources. ## Case2. Creating Master Tables for Each `canonical_id` Next, we will introduce how to create a `master_table` for each `canonical_id`. Multiple `master_table` entries can be enumerated as shown below: ```yaml master_tables: - name: master_table_ex3 canonical_id: unified_brand_id attributes: - name: td_client_id invalid_texts: [''] source_columns: - {table: '${td.tbl_aaa}', order: first, order_by: td_client_id, priority: 1} - {table: '${td.tbl_xxx}', order: first, order_by: td_client_id, priority: 1} - {table: '${td.tbl_yyy}', order: first, order_by: td_client_id, priority: 1} - {table: '${td.tbl_zzz}', order: first, order_by: td_client_id, priority: 1} - name: td_global_id valid_regexp: "3rd_*" invalid_texts: [''] source_columns: - {table: '${td.tbl_aaa}', order: last, order_by: time, priority: 1} - {table: '${td.tbl_xxx}', order: last, order_by: time, priority: 1} - {table: '${td.tbl_yyy}', order: last, order_by: time, priority: 1} - {table: '${td.tbl_zzz}', order: last, order_by: time, priority: 1} - name: master_table_brand_id_ax canonical_id: brand_id_ax attributes: - name: td_client_id invalid_texts: [''] source_columns: - {table: '${td.tbl_aaa}', order: first, order_by: td_client_id, priority: 1} - {table: '${td.tbl_xxx}', order: first, order_by: td_client_id, priority: 1} - name: td_global_id valid_regexp: "3rd_*" invalid_texts: [''] source_columns: - {table: '${td.tbl_aaa}', order: last, order_by: time, priority: 1} - {table: '${td.tbl_xxx}', order: last, order_by: time, priority: 1} - name: master_table_brand_id_yz canonical_id: brand_id_yz attributes: - name: td_client_id invalid_texts: [''] source_columns: - {table: '${td.tbl_yyy}', order: first, order_by: td_client_id, priority: 1} - {table: '${td.tbl_zzz}', order: first, order_by: td_client_id, priority: 1} - name: td_global_id valid_regexp: "3rd_*" invalid_texts: [''] source_columns: - {table: '${td.tbl_yyy}', order: last, order_by: time, priority: 1} - {table: '${td.tbl_zzz}', order: last, order_by: time, priority: 1} ``` ### `master_table` The resulting `master_table` for each case would look like this: #### `master_table_ex3` | unified_brand_id | td_client_id | td_global_id | | --- | --- | --- | | jiioKQC2uMFH | aaa_001 | 3rd_018 | #### `master_table_brand_id_ax` | brand_id_ax | td_client_id | td_global_id | | --- | --- | --- | | jiioKQC2uMFH | aaa_001 | 3rd_005 | | 0_LjXD4ScR1H | aaa_002 | 3rd_010 | | A0T5jlq77XJH | aaa_003 | 3rd_017 | ### `master_table_brand_id_yz` | brand_id_yz | td_client_id | td_global_id | | --- | --- | --- | | NCnsgr0MusJH | yyy_001 | 3rd_001 | | UA9xZODmd_pH | yyy_002 | 3rd_008 | | lb8seLF7YZBH | yyy_003 | 3rd_012 | | jaUfFmYfVjJH | yyy_004 | 3rd_018 | | nBTjvWWxqydH | zzz_001 | 3rd_002 | ### `enriched_ Table` Even if multiple `master_table`s are created, only one `enriched_table` is generated per source table. The key difference from Case1 is that all `canonical_id`s are enriched into the source tables, allowing linkage with any `master_table`. However, for `brand_id_ax` and `brand_id_yz` specified under `source_tables`, enrichment is limited to the designated tables. The enriched tables are shown below: #### `enriched_site_aaa` | site | td_client_id | td_global_id | brand_id_ax | unified_brand_id | | --- | --- | --- | --- | --- | | aaa.jp | aaa_001 | 3rd_001 | jiioKQC2uMFH | jiioKQC2uMFH | | aaa.jp | aaa_001 | 3rd_002 | jiioKQC2uMFH | jiioKQC2uMFH | | aaa.jp | aaa_001 | 3rd_003 | jiioKQC2uMFH | jiioKQC2uMFH | | aaa.jp | aaa_001 | 3rd_004 | jiioKQC2uMFH | jiioKQC2uMFH | Here, the same value appears for both `brand_id_ax` and `unified_brand_id`, as both were based on `aaa_001`, which happened to be the final leader, generating the respective `canonical_id`. However, the two `canonical_id`s are not inherently related. #### `enriched_site_xxx` | site | td_client_id | td_global_id | brand_id_ax | unified_brand_id | time | | --- | --- | --- | --- | --- | --- | | xxx.jp | | | | | 1672876800 | | xxx.jp | | | | | 1673740800 | | xxx.jp | | | | | 1674604800 | | xxx.jp | xxx_001 | 3rd_004 | jiioKQC2uMFH | jiioKQC2uMFH | 1675555200 | | xxx.jp | xxx_001 | 3rd_005 | jiioKQC2uMFH | jiioKQC2uMFH | 1676419200 | #### `enriched_site_yyy` | site | td_client_id | td_global_id | brand_id_yz | unified_brand_id | time | | --- | --- | --- | --- | --- | --- | | yyy.jp | yyy_001 | 3rd_001 | NCnsgr0MusJH | jiioKQC2uMFH | 1672876800 | | yyy.jp | | | | | 1673740800 | | yyy.jp | | | | | 1674604800 | | yyy.jp | | | | | 1675555200 | | yyy.jp | yyy_002 | 3rd_005 | UA9xZODmd_pH | jiioKQC2uMFH | 1676419200 | #### `enriched_site_zzz` | site | td_client_id | td_global_id | brand_id_yz | unified_brand_id | time | | --- | --- | --- | --- | --- | --- | | zzz.jp | | | | | 1672876800 | | zzz.jp | zzz_001 | 3rd_002 | nBTjvWWxqydH | jiioKQC2uMFH | 1673740800 | | zzz.jp | | | | | 1674604800 | | zzz.jp | | | | | 1675555200 | | zzz.jp | | | | | 1676419200 | ## Case3. Generating a Unified `canonical_id` by Merging Multiple `canonical_id`s By using `merge_by_canonical_ids:` in the configuration for each `canonical_id` within `canonical_ids:`, you can generate a unified `canonical_id` by merging other `canonical_id`s. ```yaml canonical_ids: - name: brand_id_ax merge_by_keys: [td_client_id, td_global_id] source_tables: ['site_aaa', 'site_xxx'] merge_iterations: 5 - name: brand_id_yz merge_by_keys: [td_client_id, td_global_id] source_tables: ['site_yyy', 'site_zzz'] merge_iterations: 5 - name: unified_brand_id merge_by_keys: [] merge_by_canonical_ids: [brand_id_ax, brand_id_yz] merge_iterations: 5 ``` In this configuration, `unified_brand_id` is generated by merging `brand_id_ax` and `brand_id_yz`. Since `unified_brand_id` is set as the `canonical_id` for the `master_table`, it must cover all source tables. Therefore, it is crucial to ensure that the combined `source_tables:` of `brand_id_ax` and `brand_id_yz` cover all necessary tables. Although the `merge_by_keys:` for `unified_brand_id` is empty, the configuration works because `merge_by_canonical_ids` is specified. Under this setup, the Unification Algorithm will first process `brand_id_ax` and `brand_id_yz` in parallel. Once these are complete, the Unification Algorithm for `unified_brand_id` will execute. ### Unification Algorithm with `merge_by_canonical_ids` When `merge_by_canonical_ids:` is specified, the initial graph of the Unification Algorithm takes on a completely different structure. In addition to the keys set in `merge_by_keys:`, the keys from the `merge_by_keys:` of the canonical IDs specified in `merge_by_canonical_ids:`, as well as the canonical IDs themselves, are added as keys. The configuration for `unified_brand_id` will be used to explain this process. ```yaml - name: unified_brand_id merge_by_keys: [] merge_by_canonical_ids: [brand_id_ax, brand_id_yz] merge_iterations: 5 ``` #### `unified_brand_id_graph_unify_loop_0` Under this configuration, `graph_unify_loop_0` will look as follows: Despite having an empty `merge_by_keys:`, all keys from all data sources and the canonical IDs themselves are included. #### Creation Rules for `graph_unify_loop_0` The initial graph starts by merging the following graphs, meaning that the process begins with a highly interconnected state: 1. The final graph of `brand_id_ax` 2. The final graph of `brand_id_yz` 3. The initial graph generated from the combined source tables of `brand_id_ax` and `brand_id_yz` #### Rules for Determining `merge_by_keys:` The applicable `merge_by_keys:` configuration is determined by the following rules: - The canonical IDs specified in `merge_by_canonical_ids:` are added as stitching keys. - Then, `merge_by_keys:` is determined in the following order of precedence: 1. Keys in the `merge_by_keys:` of `unified_brand_id` have the highest priority. 2. Canonical IDs from `brand_id_ax` and `brand_id_yz`, as specified in `merge_by_canonical_ids:`, have the next highest priority in the order listed. 3. The `merge_by_keys:` settings of `brand_id_ax` and `brand_id_yz` (excluding keys already used by `unified_brand_id`) have the lowest priority. In this example, the final `merge_by_keys:` is as follows: ```yaml merge_by_keys: [canonical_id (brand_id_ax), canonical_id (brand_id_yz), td_client_id, td_global_id] ``` The Unification Algorithm is applied to `graph_unify_loop_0` using this priority order. #### `unified_brand_id_graph` ![canonical_id_1](/assets/unified_brand_id_graph.4cd799a8089317c224d66028f75eca9bb3880e4083474c5773b4da49f6c7ff1e.1bb9a569.png) | key | canonical_id | | --- | --- | | 04zahkB6hA9G | wONJPFlQsJyj | In this configuration, the `canonical_id` is generated based on `canonical_id (brand_id_ax)`. However, generating a `canonical_id` based on another `canonical_id` means that any changes in the base keys of the original `canonical_id` will also affect this one, adding complexity. To avoid this, consider specifying keys in the `merge_by_keys:` of `unified_brand_id`: ```yaml - name: unified_brand_id merge_by_keys: [td_global_id] merge_by_canonical_ids: [brand_id_ax, brand_id_yz] merge_iterations: 5 ``` In this case, the true priority of `merge_by_keys:` becomes: ```yaml merge_by_keys: [td_global_id, canonical_id (brand_id_ax), canonical_id (brand_id_yz), td_client_id] ``` The initial graph remains the same, but the final graph will use `td_global_id` as the leader. ## Case4. Setting `canonical_id` as an Attribute in the `master_table` A `canonical_id` can be set as an attribute in the Parent Segment. ```yaml canonical_ids: - name: brand_id_ax merge_by_keys: [td_client_id, td_global_id] source_tables: ['site_aaa','site_xxx'] merge_iterations: 5 - name: brand_id_yz merge_by_keys: [td_client_id, td_global_id] source_tables: ['site_yyy','site_zzz'] merge_iterations: 5 - name: unified_brand_id merge_by_keys: [] merge_by_canonical_ids: [brand_id_ax, brand_id_yz] merge_iterations: 5 master_tables: - name: master_table_ex3 canonical_id: unified_brand_id attributes: - name: brand_id_ax source_canonical_id: brand_id_ax - name: brand_id_yz source_canonical_id: brand_id_yz - name: td_client_id invalid_texts: [''] source_columns: - {table: 'site_aaa', order: first, order_by: td_client_id, priority: 1} - {table: 'site_xxx', order: first, order_by: td_client_id, priority: 1} - {table: 'site_yyy', order: first, order_by: td_client_id, priority: 1} - {table: 'site_zzz', order: first, order_by: td_client_id, priority: 1} - name: td_global_id valid_regexp: "3rd_*" invalid_texts: [''] source_columns: - {table: 'site_aaa', order: last, order_by: time, priority: 1} - {table: 'site_xxx', order: last, order_by: time, priority: 1} - {table: 'site_yyy', order: last, order_by: time, priority: 1} - {table: 'site_zzz', order: last, order_by: time, priority: 1} ``` In this configuration, the canonical IDs `brand_id_ax` and `brand_id_yz` are set within the `attributes:` section. Other canonical IDs can also be configured as attributes, but they must always be related to the `canonical_id` of the `master_table` (in this case, `unified_brand_id`) through `merge_by_canonical_ids:`. It is also worth reiterating that `brand_id_ax` and `brand_id_yz` cannot be set as the canonical ID of the `master_table`. This is because they do not cover all source tables, which prevents the enriched canonical ID from being applied to all tables. ### `master_table` The output `master_table` generates records based only on the number of `unified_brand_id` specified in `canonical_id:`. It is important to understand that attributes do not influence this record generation. As seen in Case1, `brand_id_ax` and `brand_id_yz` each contained multiple canonical IDs. However, when referenced as attributes, only the canonical ID corresponding to the most recent key (in terms of the `time` column) is included. | unified_brand_id | td_client_id | td_global_id | brand_id_ax | brand_id_yz | | --- | --- | --- | --- | --- | | wONJPFlQsJyj | aaa_001 | 3rd_018 | XuCLIRp30bxG | RXcPcKu6CO1G |