This page is based on the previous page Learning canonical_id
for Cookie ID, so some duplicate explanations have been omitted; please make sure you have read through the previous page first before continuing.
This page deals with a case where the same person is identified by stitching together the td_client_id
, td_global_id
, td_ssc_id
and email of multiple sites. There are four sites, but each table has a different combination of keys that it has.
site_aaa | site_aaa | site_aaa | site_xxx | site_xxx | site_yyy | site_yyy | site_zzz | site_zzz | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
日付 | td_client_id | td_global_id | td_ssc_id | td_ssc_id | td_global_id | td_ssc_id | td_client_id | ||||||
1月 | 5日 | aaa_001 | 3rd_001 | ssc_001 | a@ex.com | ssc_001 | |||||||
15日 | aaa_001 | 3rd_002 | ssc_001 | zzz_001 | a@ex.com | ||||||||
25日 | aaa_001 | 3rd_003 | ssc_001 | ||||||||||
2月 | 5日 | aaa_001 | 3rd_004 | ssc_001 | ssc_001 | 3rd_004 | |||||||
15日 | ssc_001 | 3rd_005 | a@ex.com | ssc_001 | |||||||||
25日 | a@ex.com | ssc_001 | zzz_003 | a@ex.com | |||||||||
3月 | 5日 | zzz_003 | a@ex.com | ||||||||||
15日 | ssc_001 | 3rd_008 | zzz_003 | a@ex.com | |||||||||
25日 | aaa_002 | 3rd_009 | ssc_002 | ssc_001 | 3rd_009 | ||||||||
4月 | 5日 | aaa_002 | 3rd_010 | ssc_002 | b@ex.com | ssc_003 | |||||||
15日 | b@ex.com | ssc_003 | zzz_004 | b@ex.com | |||||||||
25日 | ssc_002 | 3rd_010 | zzz_004 | c@ex.com | |||||||||
5月 | 5日 | aaa_003 | 3rd_013 | ssc_003 | ssc_002 | 3rd_013 | |||||||
15日 | aaa_003 | 3rd_014 | ssc_003 | ||||||||||
25日 | aaa_003 | 3rd_015 | ssc_004 | c@ex.com | ssc_003 | zzz_005 | c@ex.com | ||||||
6月 | 5日 | aaa_003 | 3rd_016 | ssc_004 | ssc_003 | 3rd_016 | |||||||
15日 | ssc_003 | 3rd_017 | zzz_005 | c@ex.com | |||||||||
25日 | c@ex.com | ssc_004 | zzz_005 | c@ex.com |
The above data is assumed to be stored in TD tables with the names site_aaa, site_xxx, site_yyy and site_zzz for each site, as in the site_aaa table below. This table can be generated by using Workflow Samples.
time | site | td_client_id | td_global_id | td_ssc_id |
---|---|---|---|---|
2023/01/05 | aaa.jp | aaa_001 | 3rd_001 | ssc_001 |
2023/01/15 | aaa.jp | aaa_001 | 3rd_002 | ssc_001 |
2023/01/25 | aaa.jp | aaa_001 | 3rd_003 | ssc_001 |
2023/02/05 | aaa.jp | aaa_001 | 3rd_004 | ssc_001 |
2023/02/15 | aaa.jp | |||
2023/02/25 | aaa.jp | |||
2023/03/05 | aaa.jp | |||
2023/03/15 | aaa.jp | |||
2023/03/25 | aaa.jp | aaa_002 | 3rd_009 | ssc_002 |
2023/04/05 | aaa.jp | aaa_002 | 3rd_010 | ssc_002 |
2023/04/15 | aaa.jp | |||
2023/04/25 | aaa.jp | |||
2023/05/05 | aaa.jp | aaa_003 | 3rd_013 | ssc_003 |
2023/05/15 | aaa.jp | aaa_003 | 3rd_014 | ssc_003 |
2023/05/25 | aaa.jp | aaa_003 | 3rd_015 | ssc_004 |
2023/06/05 | aaa.jp | aaa_003 | 3rd_016 | ssc_004 |
2023/06/15 | aaa.jp | |||
2023/06/25 | aaa.jp |
time | site | td_ssc_id | td_global_id |
---|---|---|---|
2023/01/05 | xxx.jp | NULL | NULL |
2023/01/15 | xxx.jp | NULL | NULL |
2023/01/25 | xxx.jp | NULL | NULL |
2023/02/05 | xxx.jp | ssc_001 | 3rd_004 |
2023/02/15 | xxx.jp | ssc_001 | 3rd_005 |
2023/02/25 | xxx.jp | NULL | NULL |
2023/03/05 | xxx.jp | NULL | NULL |
2023/03/15 | xxx.jp | ssc_001 | 3rd_008 |
2023/03/25 | xxx.jp | ssc_001 | 3rd_009 |
2023/04/05 | xxx.jp | NULL | NULL |
2023/04/15 | xxx.jp | NULL | NULL |
2023/04/25 | xxx.jp | ssc_002 | 3rd_010 |
2023/05/05 | xxx.jp | ssc_002 | 3rd_013 |
2023/05/15 | xxx.jp | NULL | NULL |
2023/05/25 | xxx.jp | NULL | NULL |
2023/06/05 | xxx.jp | ssc_003 | 3rd_016 |
2023/06/15 | xxx.jp | ssc_003 | 3rd_017 |
2023/06/25 | xxx.jp | NULL | NULL |
time | site | td_ssc_id | |
---|---|---|---|
2023/01/05 | yyy.jp | a@ex.com | ssc_001 |
2023/01/15 | yyy.jp | NULL | NULL |
2023/01/25 | yyy.jp | NULL | NULL |
2023/02/05 | yyy.jp | NULL | NULL |
2023/02/15 | yyy.jp | a@ex.com | ssc_001 |
2023/02/25 | yyy.jp | a@ex.com | ssc_001 |
2023/03/05 | yyy.jp | NULL | NULL |
2023/03/15 | yyy.jp | NULL | NULL |
2023/03/25 | yyy.jp | NULL | NULL |
2023/04/05 | yyy.jp | b@ex.com | ssc_003 |
2023/04/15 | yyy.jp | b@ex.com | ssc_003 |
2023/04/25 | yyy.jp | NULL | NULL |
2023/05/05 | yyy.jp | NULL | NULL |
2023/05/15 | yyy.jp | NULL | NULL |
2023/05/25 | yyy.jp | c@ex.com | ssc_003 |
2023/06/05 | yyy.jp | NULL | NULL |
2023/06/15 | yyy.jp | NULL | NULL |
2023/06/25 | yyy.jp | c@ex.com | ssc_004 |
time | site | td_client_id | |
---|---|---|---|
2023/01/05 | zzz.jp | NULL | NULL |
2023/01/15 | zzz.jp | zzz_001 | a@ex.com |
2023/01/25 | zzz.jp | NULL | NULL |
2023/02/05 | zzz.jp | NULL | NULL |
2023/02/15 | zzz.jp | NULL | NULL |
2023/02/25 | zzz.jp | zzz_003 | a@ex.com |
2023/03/05 | zzz.jp | zzz_003 | a@ex.com |
2023/03/15 | zzz.jp | zzz_003 | a@ex.com |
2023/03/25 | zzz.jp | NULL | NULL |
2023/04/05 | zzz.jp | NULL | NULL |
2023/04/15 | zzz.jp | zzz_004 | b@ex.com |
2023/04/25 | zzz.jp | zzz_004 | c@ex.com |
2023/05/05 | zzz.jp | NULL | NULL |
2023/05/15 | zzz.jp | NULL | NULL |
2023/05/25 | zzz.jp | zzz_005 | c@ex.com |
2023/06/05 | zzz.jp | NULL | NULL |
2023/06/15 | zzz.jp | zzz_005 | c@ex.com |
2023/06/25 | zzz.jp | zzz_005 | c@ex.com |
+call_unification:
http_call>: https://api-cdp.treasuredata.com/unifications/workflow_call
headers:
- authorization: ${secret:td.apikey}
method: POST
retry: true
content_format: json
content:
run_canonical_ids: true
run_enrichments: true
run_master_tables: true
full_refresh: true
keep_debug_tables: true
unification:
!include : unification_ex2.yml
name: test_id_unification_ex2
keys:
- name: td_client_id
- name: td_global_id
- name: td_ssc_id
- name: email
tables:
- database: test_id_unification_ex2
table: site_aaa
key_columns:
- {column: td_client_id, key: td_client_id}
- {column: td_global_id, key: td_global_id}
- {column: td_ssc_id, key: td_ssc_id}
- database: test_id_unification_ex2
table: site_xxx
key_columns:
- {column: td_ssc_id, key: td_ssc_id}
- {column: td_global_id, key: td_global_id}
- database: test_id_unification_ex2
table: site_yyy
key_columns:
- {column: email, key: email}
- {column: td_ssc_id, key: td_ssc_id}
- database: test_id_unification_ex2
table: site_zzz
key_columns:
- {column: td_client_id, key: td_client_id}
- {column: email, key: email}
canonical_ids:
- name: person_id
merge_by_keys: [email, td_ssc_id, td_client_id, td_global_id]
merge_iterations: 5
master_tables:
- name: master_table_ex2
canonical_id: person_id
attributes:
- name: email
source_columns:
- {table: site_yyy, order: last, order_by: time, priority: 1}
- {table: site_zzz, order: last, order_by: time, priority: 1}
- name: td_ssc_id
array_elements: 5
source_columns:
- {table: site_xxx, order: last, order_by: time, priority: 1}
- {table: site_yyy, order: last, order_by: time, priority: 2}
- name: td_client_id
array_elements: 5
source_columns:
- {table: site_aaa, order: last, order_by: time, priority: 1}
- {table: site_zzz, order: last, order_by: time, priority: 4}
- name: td_global_id
array_elements: 5
source_columns:
- {table: site_aaa, order: last, order_by: time, priority: 1}
- {table: site_xxx, order: last, order_by: time, priority: 2}
The initial state of the graph is as follows:
The graph_unify_loop_0
table is created within the +extract_and_merge
task of the workflow. Let's examine how the graph is constructed.
canonical_ids:
- name: person_id
merge_by_keys: [email, td_ssc_id, td_client_id, td_global_id]
merge_iterations: 5
In this configuration of canonical_ids:
, merge_by_keys:
prioritizes email
to ensure the creation of the most universal canonical_id
, followed by td_ssc_id
.
Using the original data, the graph_unify_loop_0
table is generated through the same steps as in the previous section.
Unlike before, pairs for site_aaa
containing three or more keys in a single table are created. In this case, for a single record, the leader and followers are as follows:
- Leader:
td_ssc_id
- Followers:
td_client_id
td_global_id
td_ssc_id
As a result, the record expands into three pairs for each follower (highlighted in green in the figure above).
follower_ns
and leader_ns
identify which key corresponds to each id. (In this example, 1 is td_client_id
, 2 is td_global_id
, 3 is td_ssc_id
, and 4 is email
.) In this configuration, the priority order of ns
is [4,3,1,2].
Based on the graph from loop_0, leaders are updated according to the following rules:
The key difference this time is:
- If a leader is connected to a higher-priority leader, it will be replaced before leaders of the same priority adjacent to the follower.
All leaders are replaced with new_leader
(unless they are the minimum or maximum priority, in which case they remain unchanged). The table resulting from merging all these replacements is the graph_unify_loop_1
table.
Subsequent loops follow the same process with the graph from the previous iteration.
The process converges in the third loop.
In this example, the order of keys specified in merge_by_keys:
prioritizes email
as the highest. This is because, unlike other keys, the value of email
is less likely to change over time. For users without an email, the td_ssc_id
becomes the final leader, and the canonical_id
is generated based on this value.
The canonical_id
used in the master_table
is generally created with all keys specified in merge_by_keys:
. However, canonical_id
can also be created using a subset of the keys.
That said, the canonical_id
configured in the master_table
must meet the following condition:
- The configuration must ensure that a
canonical_id
is assigned to every record in all source tables. In other words, every source table must have at least one key included inmerge_by_keys:
.
For this example (Example 2 data), the keys in each table are as follows:
site_aaa
td_client_id
td_global_id
td_ssc_id
site_xxx
td_ssc_id
td_global_id
site_yyy
email
td_ssc_id
site_zzz
td_client_id
email
Given this, the merge_by_keys:
configurations valid for the master_table
are:
- Good:
[email, td_ssc_id, td_global_id, td_client_id]
- Good:
[td_ssc_id, td_global_id, td_client_id]
- Good:
[email, td_global_id]
- Good:
[td_ssc_id, td_client_id]
On the other hand, the following configurations will result in errors:
- Bad:
[td_global_id, td_client_id]
- Missing keys from the
site_yyy
table.
- Missing keys from the
- Bad:
[email, td_client_id]
- Missing keys from the
site_xxx
table.
- Missing keys from the
In such cases, an error like the following will appear:
400 Bad Request: {"master_tables[0].attributes[0].source_columns[0].table":["is invalid because it doesn't have columns merged into this master_table's canonical_id 'person_id'"],"master_tables[0].attributes[1].sou...
For example, in the [td_global_id, td_client_id]
case, canonical_id
is only assigned to records generated using td_global_id
and td_client_id
. Since the site_yyy
table does not include these keys, it cannot be assigned a canonical_id
. At least one key from merge_by_keys:
(used for stitching) must be present to ensure that every record in all tables is assigned a canonical_id
.