# What is do_no_merge_key? The `do_not_merge_key:` option applies stitching rules that differ from the traditional Unification stitching rules: 1. If the values of the key specified in `do_not_merge_key:` are different, records must **not** be stitched together (even if other keys link them). 2. Records without this key's value can be stitched together using other keys. ## When to Use `do_not_merge_key` 1. You have a unique key (e.g., `member_id`, `customer_id`) to ensure personal uniqueness. 2. For records lacking this key, you want to stitch based on secondary keys (e.g., `email`, `tel`) wherever possible. By setting `do_not_merge_key: member_id`, records with distinct `member_id`s remain separate. For records without `member_id`, stitching proceeds using the available keys. Let's deepen our understanding with specific examples. ## Dataset This page use a single, simple table for this example. ### site_aaa | member_id | email | tel | name | | --- | --- | --- | --- | | 1 | a@ex.com | 1111 | Taka | | 2 | a@ex.com | 2222 | Tatsuo | | 3 | b@ex.com | 3333 | Naruse | | 3 | b@ex.com | 4444 | Yuichiro | | NULL | c@ex.com | 5555 | Minero | | NULL | c@ex.com | 6666 | Kaz | ### id_unification_ex4.dig First, let's introduce the Workflow (WF) to be executed in this section. ```yaml +call_unification: http_call>: https://api-cdp.treasuredata.com/unifications/workflow_call headers: - authorization: ${secret:td.apikey} method: POST retry: true content_format: json content: run_canonical_ids: true run_enrichments: true run_master_tables: true full_refresh: true keep_debug_tables: true unification: !include : unification_ex4.yml ``` ### unification_ex4.yml ```yaml name: test_id_unification_ex4 keys: - name: member_id - name: email tables: - database: test_id_unification_ex4 table: site_aaa key_columns: - {column: member_id, key: member_id} - {column: email, key: email} canonical_ids: - name: person_id merge_by_keys: [member_id, email] merge_iterations: 3 do_not_merge_key: member_id master_tables: - name: master_table_ex4 canonical_id: person_id attributes: - name: member_id source_columns: - {table: site_aaa, priority: 1} - name: email source_columns: - {table: site_aaa, priority: 1} - name: tel array_elements: 2 source_columns: - {table: site_aaa, priority: 1} - name: name array_elements: 2 source_columns: - {table: site_aaa, priority: 1} ``` ## Cases Where Default ID Unification Falls Short For example, consider scenarios where the email used as a key includes shared addresses like mailing lists or family emails. In such cases, linking data based solely on email might mistakenly associate different individuals. - However, removing email as a key would hinder proper stitching. - On the other hand, there may also be keys like `member_id` that ensure personal uniqueness. Here’s a potential stitching approach you might consider: 1. Prioritize stitching based on the values of `member_id` wherever possible. 2. If no `member_id` value exists, attempt stitching using email as a secondary key. Let’s revisit the table below. | member_id | email | tel | name | | --- | --- | --- | --- | | 1 | a@ex.com | 1111 | Taka | | 2 | a@ex.com | 2222 | Tatsuo | | 3 | b@ex.com | 3333 | Naruse | | 3 | b@ex.com | 4444 | Yuichiro | | NULL | c@ex.com | 5555 | Minero | | NULL | c@ex.com | 6666 | Kaz | Given this data: - Rows 1 and 2 share the same email but different `member_id`s, so they shouldn’t be merged. - Rows 3 and 4 share the same `member_id` and should be merged. - Rows 5 and 6 lack `member_id`, so they should be merged based on email. This ideal stitching behavior could be achieved using a specific unification option: `do_not_merge_key`. ## Without Using `do_not_merge_key` Let’s consider a conventional approach without `do_not_merge_key` by configuring `canonical_ids` as shown below: ```yaml canonical_ids: - name: person_id merge_by_keys: [member_id, email] merge_iterations: 3 ``` The output (`enriched_site_aaa` table) wouldn’t meet expectations. Rows 1 and 2, despite having different `member_id`s, are treated as the same individual because they share the same email. This happens even though their shared email is merely a mailing list address. | member_id | email | tel | name | canonical_id | | --- | --- | --- | --- | --- | | 1 | a@ex.com | 1111 | Taka | 4ydklKlyPnfa | | 2 | a@ex.com | 2222 | Tatsuo | 4ydklKlyPnfa | | 3 | b@ex.com | 3333 | Naruse | xqaWYjT4GR3a | | 3 | b@ex.com | 4444 | Yuichiro | xqaWYjT4GR3a | | NULL | c@ex.com | 5555 | Minero | NEKDReELMAx | | NULL | c@ex.com | 6666 | Kaz | NEKDReELMAx | How can we ensure rows 1 and 2 are treated as different individuals? ## Using `do_not_merge_key` Next, let’s configure `canonical_ids` with the `do_not_merge_key` option and execute the unification: ```yaml canonical_ids: - name: person_id merge_by_keys: [member_id, email] merge_iterations: 3 do_not_merge_key: member_id ``` This results in the desired output where rows 1 and 2 are treated separately, while rows 5 and 6 are merged based on email. | member_id | email | tel | name | canonical_id | | --- | --- | --- | --- | --- | | 1 | a@ex.com | 1111 | Taka | 4ydklKlyPnfa | | 2 | a@ex.com | 2222 | Tatsuo | XNKI3XAY1Hja | | 3 | b@ex.com | 3333 | Naruse | xqaWYjT4GR3a | | 3 | b@ex.com | 4444 | Yuichiro | xqaWYjT4GR3a | | NULL | c@ex.com | 5555 | Minero | NEKDReELMAx | | NULL | c@ex.com | 6666 | Kaz | NEKDReELMAx | The `master_table` output is as follows: | person_id | member_id | email | tel | name | | --- | --- | --- | --- | --- | | 58MMqWNFdAhu | 1 | a@ex.com | [1111] | ["Taka"] | | WDbg4Lovngdu | 2 | a@ex.com | [2222] | ["Tatsuo"] | | wkL-X_7PU2Ju | 3 | b@ex.com | [4444, 3333] | ["Yuichiro", "Naruse"] | | UVs_qHcWwZnz | NULL | c@ex.com | [6666, 5555] | ["Kaz", "Minero"] | ## Explanation of Unification Algorithm The Unification Algorithm for this case is explained. ### graph_unify_loop_0 (1,2,3,graph also) ![canonical_id_1](/assets/graph_unify_loop_0.4f415820510854ccfeaadedb05afaa03fc5d18422dc7ee81f2538f08ee7f9433.f8b38615.png) In fact, the initial graph is the final state this time. In the conventional Algorithm, "leader_id: 2" should be replaced (merged) by "leader_id: 1" , but `do_not_merge_key:` is set to member_id, so that it is not merged. So, no further changes are made. ### Generation and assignment of `canonical_id` In this case, four different `canonical_ids` will be generated. ![canonical_id_1](/assets/7-2-1.b3ba0da06f541a3416359a88c322e0c18fac26db0177ad5ed4cfb064081077d0.f8b38615.png) The difference is that "follower: a@ex.com" points to two leaders, "member_id: 1" and "member_id: 2," regardless of the state of convergence. In fact, there are 2 rows of records for a@ex.com in the lookup table. | id | `canonical_id` | | --- | --- | | 1 | 4ydklKlyPnfa | | 2 | XNKI3XAY1Hja | | 3 | xqaWYjT4GR3a | | a@ex.com | 4ydklKlyPnfa | | a@ex.com | XNKI3XAY1Hja | | b@ex.com | xqaWYjT4GR3a | | c@ex.com | NEKDReELMAxH | This means that the algorithm has converged without "all the followers point to only one leader," which is one of the [methods to determine algorithm convergence](/en/tools/idu/p9_knowledge). However, the other method, "checking the output of the `+report_diff` task in WF Logs" is still valid. ## Cases where do_not_merge_key should be used Finally, let's review the cases in which do_not_merge_key should be used. 1. 1. (Assumption) You have one key whose uniqueness is guaranteed (e.g. member_id, customer_id, etc.). 2. for records that do not have the value of key 1., we want to proceed with stitching as much as possible, using a key whose uniqueness is not partially guaranteed (e.g., email, tel, etc.). In this case, if you set `do_not_merge_key: member_id`, the records will be judged as different persons even if other keys are stitched with the same value, as long as member_id is different. However, for records that do not have member_id, other keys are stitched together to the maximum extent possible.