Skip to content

What is do_no_merge_key?

The do_not_merge_key: option applies stitching rules that differ from the traditional Unification stitching rules:

  1. If the values of the key specified in do_not_merge_key: are different, records must not be stitched together (even if other keys link them).
  2. Records without this key's value can be stitched together using other keys.

When to Use do_not_merge_key

  1. You have a unique key (e.g., member_id, customer_id) to ensure personal uniqueness.
  2. For records lacking this key, you want to stitch based on secondary keys (e.g., email, tel) wherever possible.

By setting do_not_merge_key: member_id, records with distinct member_ids remain separate. For records without member_id, stitching proceeds using the available keys.

Let's deepen our understanding with specific examples.

Dataset

This page use a single, simple table for this example.

site_aaa

member_idemailtelname
1a@ex.com1111Taka
2a@ex.com2222Tatsuo
3b@ex.com3333Naruse
3b@ex.com4444Yuichiro
NULLc@ex.com5555Minero
NULLc@ex.com6666Kaz

id_unification_ex4.dig

First, let's introduce the Workflow (WF) to be executed in this section.

+call_unification:
  http_call>: https://api-cdp.treasuredata.com/unifications/workflow_call  
  headers:
    - authorization: ${secret:td.apikey}
  method: POST
  retry: true
  content_format: json
  content:

    run_canonical_ids: true
    run_enrichments: true
    run_master_tables: true

    full_refresh: true
    keep_debug_tables: true

    unification:
      !include : unification_ex4.yml

unification_ex4.yml

name: test_id_unification_ex4

keys:
  - name: member_id
  - name: email

tables:
  - database: test_id_unification_ex4
    table: site_aaa
    key_columns:
      - {column: member_id, key: member_id}
      - {column: email, key: email}

canonical_ids:
  - name: person_id
    merge_by_keys: [member_id, email]
    merge_iterations: 3
    do_not_merge_key: member_id

master_tables:
  - name: master_table_ex4
    canonical_id: person_id
    attributes:
      - name: member_id
        source_columns:
          - {table: site_aaa, priority: 1}
      - name: email
        source_columns:
          - {table: site_aaa, priority: 1}
      - name: tel
        array_elements: 2
        source_columns:
          - {table: site_aaa, priority: 1}
      - name: name
        array_elements: 2
        source_columns:
          - {table: site_aaa, priority: 1}

Cases Where Default ID Unification Falls Short

For example, consider scenarios where the email used as a key includes shared addresses like mailing lists or family emails. In such cases, linking data based solely on email might mistakenly associate different individuals.

  • However, removing email as a key would hinder proper stitching.
  • On the other hand, there may also be keys like member_id that ensure personal uniqueness.

Here’s a potential stitching approach you might consider:

  1. Prioritize stitching based on the values of member_id wherever possible.
  2. If no member_id value exists, attempt stitching using email as a secondary key.

Let’s revisit the table below.

member_idemailtelname
1a@ex.com1111Taka
2a@ex.com2222Tatsuo
3b@ex.com3333Naruse
3b@ex.com4444Yuichiro
NULLc@ex.com5555Minero
NULLc@ex.com6666Kaz

Given this data:

  • Rows 1 and 2 share the same email but different member_ids, so they shouldn’t be merged.
  • Rows 3 and 4 share the same member_id and should be merged.
  • Rows 5 and 6 lack member_id, so they should be merged based on email.

This ideal stitching behavior could be achieved using a specific unification option: do_not_merge_key.

Without Using do_not_merge_key

Let’s consider a conventional approach without do_not_merge_key by configuring canonical_ids as shown below:

canonical_ids:
  - name: person_id
    merge_by_keys: [member_id, email]
    merge_iterations: 3

The output (enriched_site_aaa table) wouldn’t meet expectations. Rows 1 and 2, despite having different member_ids, are treated as the same individual because they share the same email. This happens even though their shared email is merely a mailing list address.

member_idemailtelnamecanonical_id
1a@ex.com1111Taka4ydklKlyPnfa
2a@ex.com2222Tatsuo4ydklKlyPnfa
3b@ex.com3333NarusexqaWYjT4GR3a
3b@ex.com4444YuichiroxqaWYjT4GR3a
NULLc@ex.com5555MineroNEKDReELMAx
NULLc@ex.com6666KazNEKDReELMAx

How can we ensure rows 1 and 2 are treated as different individuals?

Using do_not_merge_key

Next, let’s configure canonical_ids with the do_not_merge_key option and execute the unification:

canonical_ids:
  - name: person_id
    merge_by_keys: [member_id, email]
    merge_iterations: 3
    do_not_merge_key: member_id

This results in the desired output where rows 1 and 2 are treated separately, while rows 5 and 6 are merged based on email.

member_idemailtelnamecanonical_id
1a@ex.com1111Taka4ydklKlyPnfa
2a@ex.com2222TatsuoXNKI3XAY1Hja
3b@ex.com3333NarusexqaWYjT4GR3a
3b@ex.com4444YuichiroxqaWYjT4GR3a
NULLc@ex.com5555MineroNEKDReELMAx
NULLc@ex.com6666KazNEKDReELMAx

The master_table output is as follows:

person_idmember_idemailtelname
58MMqWNFdAhu1a@ex.com[1111]["Taka"]
WDbg4Lovngdu2a@ex.com[2222]["Tatsuo"]
wkL-X_7PU2Ju3b@ex.com[4444, 3333]["Yuichiro", "Naruse"]
UVs_qHcWwZnzNULLc@ex.com[6666, 5555]["Kaz", "Minero"]

Explanation of Unification Algorithm

The Unification Algorithm for this case is explained.

graph_unify_loop_0 (1,2,3,graph also)

canonical_id_1

In fact, the initial graph is the final state this time. In the conventional Algorithm, "leader_id: 2" should be replaced (merged) by "leader_id: 1" , but do_not_merge_key: is set to member_id, so that it is not merged. So, no further changes are made.

Generation and assignment of canonical_id

In this case, four different canonical_ids will be generated.

canonical_id_1

The difference is that "follower: a@ex.com" points to two leaders, "member_id: 1" and "member_id: 2," regardless of the state of convergence.

In fact, there are 2 rows of records for a@ex.com in the lookup table.

idcanonical_id
14ydklKlyPnfa
2XNKI3XAY1Hja
3xqaWYjT4GR3a
a@ex.com4ydklKlyPnfa
a@ex.comXNKI3XAY1Hja
b@ex.comxqaWYjT4GR3a
c@ex.comNEKDReELMAxH

This means that the algorithm has converged without "all the followers point to only one leader," which is one of the methods to determine algorithm convergence. However, the other method, "checking the output of the +report_diff task in WF Logs" is still valid.

Cases where do_not_merge_key should be used

Finally, let's review the cases in which do_not_merge_key should be used. 1.

  1. (Assumption) You have one key whose uniqueness is guaranteed (e.g. member_id, customer_id, etc.).
  2. for records that do not have the value of key 1., we want to proceed with stitching as much as possible, using a key whose uniqueness is not partially guaranteed (e.g., email, tel, etc.).

In this case, if you set do_not_merge_key: member_id, the records will be judged as different persons even if other keys are stitched with the same value, as long as member_id is different. However, for records that do not have member_id, other keys are stitched together to the maximum extent possible.