Skip to content

Learning canonical_id for Cookie ID and Email

This page is based on the previous page Learning canonical_id for Cookie ID, so some duplicate explanations have been omitted; please make sure you have read through the previous page first before continuing.

This page deals with a case where the same person is identified by stitching together the td_client_id, td_global_id, td_ssc_id and email of multiple sites. There are four sites, but each table has a different combination of keys that it has.

site_aaasite_aaasite_aaasite_xxxsite_xxxsite_yyysite_yyysite_zzzsite_zzz
日付td_client_idtd_global_idtd_ssc_idtd_ssc_idtd_global_idemailtd_ssc_idtd_client_idemail
1月5日aaa_0013rd_001ssc_001a@ex.comssc_001
15日aaa_0013rd_002ssc_001zzz_001a@ex.com
25日aaa_0013rd_003ssc_001
2月5日aaa_0013rd_004ssc_001ssc_0013rd_004
15日ssc_0013rd_005a@ex.comssc_001
25日a@ex.comssc_001zzz_003a@ex.com
3月5日zzz_003a@ex.com
15日ssc_0013rd_008zzz_003a@ex.com
25日aaa_0023rd_009ssc_002ssc_0013rd_009
4月5日aaa_0023rd_010ssc_002b@ex.comssc_003
15日b@ex.comssc_003zzz_004b@ex.com
25日ssc_0023rd_010zzz_004c@ex.com
5月5日aaa_0033rd_013ssc_003ssc_0023rd_013
15日aaa_0033rd_014ssc_003
25日aaa_0033rd_015ssc_004c@ex.comssc_003zzz_005c@ex.com
6月5日aaa_0033rd_016ssc_004ssc_0033rd_016
15日ssc_0033rd_017zzz_005c@ex.com
25日c@ex.comssc_004zzz_005c@ex.com

Data Preparation

The above data is assumed to be stored in TD tables with the names site_aaa, site_xxx, site_yyy and site_zzz for each site, as in the site_aaa table below. This table can be generated by using Workflow Samples.

site_aaa

timesitetd_client_idtd_global_idtd_ssc_id
2023/01/05aaa.jpaaa_0013rd_001ssc_001
2023/01/15aaa.jpaaa_0013rd_002ssc_001
2023/01/25aaa.jpaaa_0013rd_003ssc_001
2023/02/05aaa.jpaaa_0013rd_004ssc_001
2023/02/15aaa.jp
2023/02/25aaa.jp
2023/03/05aaa.jp
2023/03/15aaa.jp
2023/03/25aaa.jpaaa_0023rd_009ssc_002
2023/04/05aaa.jpaaa_0023rd_010ssc_002
2023/04/15aaa.jp
2023/04/25aaa.jp
2023/05/05aaa.jpaaa_0033rd_013ssc_003
2023/05/15aaa.jpaaa_0033rd_014ssc_003
2023/05/25aaa.jpaaa_0033rd_015ssc_004
2023/06/05aaa.jpaaa_0033rd_016ssc_004
2023/06/15aaa.jp
2023/06/25aaa.jp

site_xxx

timesitetd_ssc_idtd_global_id
2023/01/05xxx.jpNULLNULL
2023/01/15xxx.jpNULLNULL
2023/01/25xxx.jpNULLNULL
2023/02/05xxx.jpssc_0013rd_004
2023/02/15xxx.jpssc_0013rd_005
2023/02/25xxx.jpNULLNULL
2023/03/05xxx.jpNULLNULL
2023/03/15xxx.jpssc_0013rd_008
2023/03/25xxx.jpssc_0013rd_009
2023/04/05xxx.jpNULLNULL
2023/04/15xxx.jpNULLNULL
2023/04/25xxx.jpssc_0023rd_010
2023/05/05xxx.jpssc_0023rd_013
2023/05/15xxx.jpNULLNULL
2023/05/25xxx.jpNULLNULL
2023/06/05xxx.jpssc_0033rd_016
2023/06/15xxx.jpssc_0033rd_017
2023/06/25xxx.jpNULLNULL

site_yyy

timesiteemailtd_ssc_id
2023/01/05yyy.jpa@ex.comssc_001
2023/01/15yyy.jpNULLNULL
2023/01/25yyy.jpNULLNULL
2023/02/05yyy.jpNULLNULL
2023/02/15yyy.jpa@ex.comssc_001
2023/02/25yyy.jpa@ex.comssc_001
2023/03/05yyy.jpNULLNULL
2023/03/15yyy.jpNULLNULL
2023/03/25yyy.jpNULLNULL
2023/04/05yyy.jpb@ex.comssc_003
2023/04/15yyy.jpb@ex.comssc_003
2023/04/25yyy.jpNULLNULL
2023/05/05yyy.jpNULLNULL
2023/05/15yyy.jpNULLNULL
2023/05/25yyy.jpc@ex.comssc_003
2023/06/05yyy.jpNULLNULL
2023/06/15yyy.jpNULLNULL
2023/06/25yyy.jpc@ex.comssc_004

site_zzz

timesitetd_client_idemail
2023/01/05zzz.jpNULLNULL
2023/01/15zzz.jpzzz_001a@ex.com
2023/01/25zzz.jpNULLNULL
2023/02/05zzz.jpNULLNULL
2023/02/15zzz.jpNULLNULL
2023/02/25zzz.jpzzz_003a@ex.com
2023/03/05zzz.jpzzz_003a@ex.com
2023/03/15zzz.jpzzz_003a@ex.com
2023/03/25zzz.jpNULLNULL
2023/04/05zzz.jpNULLNULL
2023/04/15zzz.jpzzz_004b@ex.com
2023/04/25zzz.jpzzz_004c@ex.com
2023/05/05zzz.jpNULLNULL
2023/05/15zzz.jpNULLNULL
2023/05/25zzz.jpzzz_005c@ex.com
2023/06/05zzz.jpNULLNULL
2023/06/15zzz.jpzzz_005c@ex.com
2023/06/25zzz.jpzzz_005c@ex.com

id_unification_ex2.dig

+call_unification:
  http_call>: https://api-cdp.treasuredata.com/unifications/workflow_call  
  headers:
    - authorization: ${secret:td.apikey}
  method: POST
  retry: true
  content_format: json
  content:

    run_canonical_ids: true
    run_enrichments: true
    run_master_tables: true

    full_refresh: true
    keep_debug_tables: true

    unification:
      !include : unification_ex2.yml

unification_ex2.yml

name: test_id_unification_ex2

keys:
  - name: td_client_id
  - name: td_global_id
  - name: td_ssc_id
  - name: email

tables:
  - database: test_id_unification_ex2
    table: site_aaa
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: td_global_id, key: td_global_id}
      - {column: td_ssc_id, key: td_ssc_id}

  - database: test_id_unification_ex2
    table: site_xxx
    key_columns:
      - {column: td_ssc_id, key: td_ssc_id}
      - {column: td_global_id, key: td_global_id}

  - database: test_id_unification_ex2
    table: site_yyy
    key_columns:
      - {column: email, key: email}
      - {column: td_ssc_id, key: td_ssc_id}

  - database: test_id_unification_ex2
    table: site_zzz
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: email, key: email}

canonical_ids:
  - name: person_id
    merge_by_keys: [email, td_ssc_id, td_client_id, td_global_id]    
    merge_iterations: 5

master_tables:
  - name: master_table_ex2
    canonical_id: person_id
    attributes:
      - name: email
        source_columns:
          - {table: site_yyy, order: last, order_by: time, priority: 1}
          - {table: site_zzz, order: last, order_by: time, priority: 1}                    
      - name: td_ssc_id
        array_elements: 5
        source_columns:
          - {table: site_xxx, order: last, order_by: time, priority: 1}
          - {table: site_yyy, order: last, order_by: time, priority: 2}
      - name: td_client_id
        array_elements: 5
        source_columns:
          - {table: site_aaa, order: last, order_by: time, priority: 1}
          - {table: site_zzz, order: last, order_by: time, priority: 4}                    
      - name: td_global_id
        array_elements: 5
        source_columns:
          - {table: site_aaa, order: last, order_by: time, priority: 1}
          - {table: site_xxx, order: last, order_by: time, priority: 2}

Explanation of the Unification Algorithm

graph_unify_loop_0

The initial state of the graph is as follows:

canonical_id_1

How the graph_unify_loop_0 Table Is Created

The graph_unify_loop_0 table is created within the +extract_and_merge task of the workflow. Let's examine how the graph is constructed.

canonical_ids:
  - name: person_id
    merge_by_keys: [email, td_ssc_id, td_client_id, td_global_id]    
    merge_iterations: 5

In this configuration of canonical_ids:, merge_by_keys: prioritizes email to ensure the creation of the most universal canonical_id, followed by td_ssc_id.

Using the original data, the graph_unify_loop_0 table is generated through the same steps as in the previous section.

canonical_id_1

Unlike before, pairs for site_aaa containing three or more keys in a single table are created. In this case, for a single record, the leader and followers are as follows:

  • Leader:
    • td_ssc_id
  • Followers:
    • td_client_id
    • td_global_id
    • td_ssc_id

As a result, the record expands into three pairs for each follower (highlighted in green in the figure above).

follower_ns and leader_ns identify which key corresponds to each id. (In this example, 1 is td_client_id, 2 is td_global_id, 3 is td_ssc_id, and 4 is email.) In this configuration, the priority order of ns is [4,3,1,2].

graph_unify_loop_1

Based on the graph from loop_0, leaders are updated according to the following rules:

canonical_id_1

The key difference this time is:

  • If a leader is connected to a higher-priority leader, it will be replaced before leaders of the same priority adjacent to the follower.

All leaders are replaced with new_leader (unless they are the minimum or maximum priority, in which case they remain unchanged). The table resulting from merging all these replacements is the graph_unify_loop_1 table.

canonical_id_1

canonical_id_1

Subsequent loops follow the same process with the graph from the previous iteration.

graph_unify_loop_2

canonical_id_1

graph_unify_loop_3

The process converges in the third loop.

canonical_id_1

Generating and Assigning canonical_id

In this example, the order of keys specified in merge_by_keys: prioritizes email as the highest. This is because, unlike other keys, the value of email is less likely to change over time. For users without an email, the td_ssc_id becomes the final leader, and the canonical_id is generated based on this value.

canonical_id_1

canonical_id Configurations for master_table

The canonical_id used in the master_table is generally created with all keys specified in merge_by_keys:. However, canonical_id can also be created using a subset of the keys.

That said, the canonical_id configured in the master_table must meet the following condition:

  • The configuration must ensure that a canonical_id is assigned to every record in all source tables. In other words, every source table must have at least one key included in merge_by_keys:.

For this example (Example 2 data), the keys in each table are as follows:

  • site_aaa
    • td_client_id
    • td_global_id
    • td_ssc_id
  • site_xxx
    • td_ssc_id
    • td_global_id
  • site_yyy
    • email
    • td_ssc_id
  • site_zzz
    • td_client_id
    • email

Given this, the merge_by_keys: configurations valid for the master_table are:

  • Good: [email, td_ssc_id, td_global_id, td_client_id]
  • Good: [td_ssc_id, td_global_id, td_client_id]
  • Good: [email, td_global_id]
  • Good: [td_ssc_id, td_client_id]

On the other hand, the following configurations will result in errors:

  • Bad: [td_global_id, td_client_id]
    • Missing keys from the site_yyy table.
  • Bad: [email, td_client_id]
    • Missing keys from the site_xxx table.

In such cases, an error like the following will appear:

400 Bad Request: {"master_tables[0].attributes[0].source_columns[0].table":["is invalid because it doesn't have columns merged into this master_table's canonical_id 'person_id'"],"master_tables[0].attributes[1].sou...

For example, in the [td_global_id, td_client_id] case, canonical_id is only assigned to records generated using td_global_id and td_client_id. Since the site_yyy table does not include these keys, it cannot be assigned a canonical_id. At least one key from merge_by_keys: (used for stitching) must be present to ensure that every record in all tables is assigned a canonical_id.