Skip to content

Learning canonical_id for Cookie ID

In this page, we will explain about cases where the same person is identified by stitching together the td_client_id and td_global_id of multiple sites. Consider the following data example where there are four sites and td_client_id and td_global_id are available in all tables.

datesite_aaasite_aaasite_xxxsite_xxxsite_yyysite_yyysite_zzzsite_zzz
monthdaytd_client_idtd_global_idtd_client_idtd_global_idtd_client_idtd_global_idtd_client_idtd_global_id
15aaa_0013rd_001yyy_0013rd_001
15aaa_0013rd_002zzz_0013rd_002
25aaa_0013rd_003
25aaa_0013rd_004xxx_0013rd_004
15xxx_0013rd_005yyy_0023rd_005
25yyy_0023rd_006zzz_0033rd_006
35zzz_0033rd_007
15xxx_0023rd_008zzz_0033rd_008
25aaa_0023rd_009xxx_0023rd_009
45aaa_0023rd_010yyy_0033rd_010
15yyy_0033rd_011zzz_0043rd_011
25xxx_0033rd_012zzz_0043rd_012
55aaa_0033rd_013xxx_0033rd_013
15aaa_0033rd_014
25aaa_0033rd_015yyy_0043rd_015zzz_0053rd_015
65aaa_0033rd_016xxx_0043rd_016
15xxx_0043rd_017zzz_0053rd_017
25yyy_0053rd_018zzz_0053rd_018

It may appear that there are multiple users, but in fact these are cases that can be identified to just one person by connecting them with td_client_id and td_global_id across the site.

canonical_id_1

If ID Unification is performed with this data as input, only one canonical_id is expected to be generated. Let's check this in practice.

Data Preparation

We proceed on the assumption that the above data is stored in TD tables with the names site_aaa, site_xxx, site_yyy and site_zzz for each site, as in the site_aaa table below. This table can be generated by using Workflow Samples.

site_aaa

timesitetd_client_idtd_global_id
2023/01/05aaa.jpaaa_0013rd_001
2023/01/15aaa.jpaaa_0013rd_002
2023/01/25aaa.jpaaa_0013rd_003
2023/02/05aaa.jpaaa_0013rd_004
2023/02/15aaa.jp
2023/02/25aaa.jp
2023/03/05aaa.jp
2023/03/15aaa.jp
2023/03/25aaa.jpaaa_0023rd_009
2023/04/05aaa.jpaaa_0023rd_010
2023/04/15aaa.jp
2023/04/25aaa.jp
2023/05/05aaa.jpaaa_0033rd_013
2023/05/15aaa.jpaaa_0033rd_014
2023/05/25aaa.jpaaa_0033rd_015
2023/06/05aaa.jpaaa_0033rd_016
2023/06/15aaa.jp
2023/06/25aaa.jp

site_xxx

timesitetd_client_idtd_global_id
2023/01/05xxx.jp
2023/01/15xxx.jp
2023/01/25xxx.jp
2023/02/05xxx.jpxxx_0013rd_004
2023/02/15xxx.jpxxx_0013rd_005
2023/02/25xxx.jp
2023/03/05xxx.jp
2023/03/15xxx.jpxxx_0023rd_008
2023/03/25xxx.jpxxx_0023rd_009
2023/04/05xxx.jp
2023/04/15xxx.jp
2023/04/25xxx.jpxxx_0033rd_012
2023/05/05xxx.jpxxx_0033rd_013
2023/05/15xxx.jp
2023/05/25xxx.jp
2023/06/05xxx.jpxxx_0043rd_016
2023/06/15xxx.jpxxx_0043rd_017
2023/06/25xxx.jp

site_yyy

timesitetd_client_idtd_global_id
2023/01/05yyy.jpyyy_0013rd_001
2023/01/15yyy.jp
2023/01/25yyy.jp
2023/02/05yyy.jp
2023/02/15yyy.jpyyy_0023rd_005
2023/02/25yyy.jpyyy_0023rd_006
2023/03/05yyy.jp
2023/03/15yyy.jp
2023/03/25yyy.jp
2023/04/05yyy.jpyyy_0033rd_010
2023/04/15yyy.jpyyy_0033rd_011
2023/04/25yyy.jp
2023/05/05yyy.jp
2023/05/15yyy.jp
2023/05/25yyy.jpyyy_0043rd_015
2023/06/05yyy.jp
2023/06/15yyy.jp
2023/06/25yyy.jpyyy_0053rd_018

site_zzz

timesitetd_client_idtd_global_id
2023/01/05zzz.jp
2023/01/15zzz.jpzzz_0013rd_002
2023/01/25zzz.jp
2023/02/05zzz.jp
2023/02/15zzz.jp
2023/02/25zzz.jpzzz_0033rd_006
2023/03/05zzz.jpzzz_0033rd_007
2023/03/15zzz.jpzzz_0033rd_008
2023/03/25zzz.jp
2023/04/05zzz.jp
2023/04/15zzz.jpzzz_0043rd_011
2023/04/25zzz.jpzzz_0043rd_012
2023/05/05zzz.jp
2023/05/15zzz.jp
2023/05/25zzz.jpzzz_0053rd_015
2023/06/05zzz.jp
2023/06/15zzz.jpzzz_0053rd_017
2023/06/25zzz.jpzzz_0053rd_018

id_unification_ex1.dig

+call_unification:
  http_call>: https://api-cdp.treasuredata.com/unifications/workflow_call
  headers:
    - authorization: ${secret:td.apikey}
  method: POST
  retry: true
  content_format: json
  content:
    run_canonical_ids: true
    run_enrichments: true
    run_master_tables: true

    full_refresh: true
    keep_debug_tables: true

    unification:
      !include : unification_ex1.yml

unification_ex1.yml

name: test_id_unification_ex1

keys:
  - name: td_client_id
    invalid_texts: ['']

  - name: td_global_id
    valid_regexp: "3rd_*"
    invalid_texts: ['']

tables:
  - database: test_id_unification_ex1
    table: site_aaa
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: td_global_id, key: td_global_id}
  - database: test_id_unification_ex1
    table: site_xxx
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: td_global_id, key: td_global_id}

  - database: test_id_unification_ex1
    table: site_yyy
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: td_global_id, key: td_global_id}

  - database: test_id_unification_ex1
    table: site_zzz
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: td_global_id, key: td_global_id}

canonical_ids:
  - name: unified_cookie_id
    merge_by_keys: [td_client_id, td_global_id]
    merge_iterations: 5

master_tables:
  - name: master_table_ex1
    canonical_id: unified_cookie_id
    attributes:
      - name: td_client_id
        invalid_texts: ['']
        array_elements: 5
        source_columns:
          - {table: site_aaa, order: first, order_by: td_client_id, priority: 1}
          - {table: site_xxx, order: first, order_by: td_client_id, priority: 2}
          - {table: site_yyy, order: first, order_by: td_client_id, priority: 3}
          - {table: site_zzz, order: first, order_by: td_client_id, priority: 4}
      - name: td_global_id
        valid_regexp: "3rd_*"
        invalid_texts: ['']
        source_columns:
          - {table: site_aaa, order: last, order_by: time, priority: 1}
          - {table: site_xxx, order: last, order_by: time, priority: 1}
          - {table: site_yyy, order: last, order_by: time, priority: 1}
          - {table: site_zzz, order: last, order_by: time, priority: 1}

Explanation of the Unification Algorithm

While this algorithm is commonly referred to as the Union-Find Algorithm, it will be referred to as the Unification Algorithm in this document. The Unification Algorithm represents the relationships between keys as a directed graph and transforms the graph into a structure that allows the system to identify groups of keys as representing the same individual through iterative processing.

Initial State of the Graph

The +extract_and_merge task in the workflow generates the graph_unify_loop_0 table. This represents the initial state of the graph, illustrating the "follower -> leader" relationships between keys.

canonical_id_1

graph_unify_loop_0

The initial graph represented by the graph_unify_loop_0 table is visualized using Graphviz as shown below.

canonical_id_1

How the graph_unify_loop_0 Table Is Created

Let’s examine how the graph_unify_loop_0 table, which represents the initial graph, is generated.

canonical_ids:
  - name: unified_cookie_id
    merge_by_keys: [td_client_id, td_global_id]
    merge_iterations: 5

In the merge_by_keys: section of canonical_ids:, keys used for stitching are listed in order of priority. This priority influences how the graph is constructed.

canonical_id_1

The graph_unify_loop_0 table is generated from the original data through the following steps:

canonical_id_1

  1. Among all key columns, the highest-priority key is designated as the leader.
  2. All key columns (including the leader) are expanded into rows as followers paired with their leader.
  3. The source table is similarly expanded for all tables.
  4. The graph_unify_loop_0 table is created by performing a UNION ALL on all tables and extracting unique pairs of follower_id, follower_ns, leader_id, and leader_ns.

The follower_ns and leader_ns specify the key each ID belongs to (in this example, 1 corresponds to td_client_id, and 2 to td_global_id). The priority of keys in this setup is [1, 2].

graph_unify_loop_1

Using the graph from graph_unify_loop_0, leaders are updated based on the following rules:

canonical_id_1

  • For each leader, compare its value with all other leaders connected via a single follower.
  • If a smaller value exists (in terms of string comparison), the leader is replaced by the one with the smaller value. The original leader becomes the old leader and is replaced by the new leader.
  • An "old_leader -> old_leader" relationship will always exist and must be replaced by "old_leader -> new_leader." This change signifies that the old leader is now a follower of the new leader.

The table created after all leaders are updated to their new leaders becomes the graph_unify_loop_1 table.

canonical_id_1

This process results in many leaders being replaced, with old leaders pointing to new leaders as followers. The same process is repeated in subsequent loops.

canonical_id_1

graph_unify_loop_2

canonical_id_1

graph_unify_loop_3

canonical_id_1

graph_unify_loop_4

canonical_id_1

graph_unify_loop_5 (graph)

In this example, with merge_iterations: set to 5, graph_unify_loop_5 represents the final state of the algorithm. Note that this table is output as graph rather than graph_unify_loop_5.

Determining Convergence

The Unification Algorithm is guaranteed to converge. In graph_unify_loop_4, only one leader remains, and all followers point to it. This is considered a converged state. The convergence conditions are:

  1. A single leader is consolidated for each individual.
  2. All keys point exclusively to this leader.

Once converged, a canonical ID can be assigned to all keys for the same individual. In this example, a minimum of 4 iterations (merge_iterations: 4) is necessary to achieve convergence.

Note

Setting a higher value for merge_iterations: is not problematic since processing stops upon convergence. However, note that graph_unify_loop_N tables are only created up to the point of convergence.

Impact of Changing merge_by_keys: Order

If the order in merge_by_keys: is changed to prioritize td_global_id as shown below, the graph evolution will differ completely:

canonical_ids:
  - name: unified_cookie_id
    # merge_by_keys: [td_client_id, td_global_id]
    merge_by_keys: [td_global_id, td_client_id]
    merge_iterations: 5

graph_unify_loop_0

graph_unify_loop_0

graph_unify_loop_1

graph_unify_loop_1

graph_unify_loop_2

graph_unify_loop_2

graph_unify_loop_3

graph_unify_loop_3

graph_unify_loop_4

graph_unify_loop_4

Is the order good for the least number of loops?

You might think that a good merge_by_keys: order is one that minimises the number of loops, but this is not true. (To begin with, finding the order of keys that minimises the number of loops is not realistic, as you have to run every combination of the order at least once.)

The important thing is not the number of loops, but to set the order so that the canonical_id generated for each individual is as invariant as possible (for subsequent workflow updates).

Generation and Assignment of canonical_id

The process of assigning canonical_id to the final graph after running the Unification Algorithm is executed through the +canonicalize task. For each user, a canonical_id is generated based on the leader's value, and the same ID is distributed to all its followers.

The canonical_id is derived based on the value of the leader key (graph.leader_id) using the following calculation:

to_base64url(
  to_big_endian_64(
    bitwise_xor(
      from_big_endian_64(
        substr(
          sha256(to_utf8(graph.leader_id)), 1, 8
        )
      ), leader_keys.key_mask_low64i
    )
  ) || leader_keys.key_mask_high8b
)

canonical_id_1

In the above example, canonical_id for User1 and User2 is generated based on the td_client_id that became the leader. In contrast, User3's td_client_id is absent, so the leader is set to the next prioritized value (td_global_id), and the canonical_id is generated accordingly. The generated canonical_id is also assigned to all connecting followers, ensuring that one leader and all its followers share the same canonical_id. The resulting lookup table records this information, allowing you to see the canonical_id assigned to each key.

Is canonical_id Immutable?

The canonical_id is not guaranteed to remain consistent (for the same user) across every run of the Unification Workflow.

canonical_id_1

In the next update, suppose a smaller value leader, "aaa_00", appears for User1. In this case, User1's leader becomes "aaa_00" (as the smallest value becomes the sole leader), and a different canonical_id is generated. A different canonical_id implies a different individual, indicating the emergence of a new user. For example, in Audience Studio, which uses this canonical_id, the cdp_customer_id for User1 would change.

Determining the Order of merge_by_keys

It is crucial to ensure that the canonical_id remains as consistent as possible across updates to the Unification Workflow. The order of keys in merge_by_keys: has a significant impact on the immutability of canonical_id.

The Unification Algorithm prioritizes the keys specified in merge_by_keys: in order, setting the earliest (highest priority) key as the leader whenever possible. This means the key at the top of the list is often chosen as the leader for each user.

Therefore, merge_by_keys: should be set with keys ordered from those least likely to change over time to ensure stability. For users who lack the highest-priority key, the second-priority key becomes the leader, making the order of keys critically important.

Examples of merge_by_keys: Settings

  • Good: merge_by_keys: [email, td_global_id]
    • email is less likely to change or increase for individuals, ensuring the same value is selected across updates (and thus a consistent canonical_id).
  • Good: merge_by_keys: [member_id, email, td_global_id]
    • member_id or customer_id is stable for individuals, ensuring the same value is selected across updates (and thus a consistent canonical_id).
  • Bad: merge_by_keys: [td_client_id, email, td_global_id]
    • td_client_id may change over time. If a newly generated value is smaller than the previous minimum, it becomes the leader, resulting in a different canonical_id from the previous run.

ID Unification Output

Here, we will closely examine the tables outputted in this example. This will provide more detailed insights than the explanation provided in Unification Output, including the meaning of each column in the tables.

Database for Storing Results

A database named cdp_unification_${name}, based on the name: value defined in unification_ex1.yml, is created to store the results. In this example, the database is named cdp_unification_test_id_unification_ex1.

name: test_id_unification_ex1 # Unification Name

keys:
  - name: td_client_id
  - name: td_global_id

tables:
  - database: test_id_unification_ex1
    table: ex1_site_aaa
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: td_global_id, key: td_global_id}

canonical_id_1

Tables Utilizing the Results

master_table

In this example, the table is named master_table_ex1.

master_tables:
  - name: master_table_ex1
    canonical_id: unified_cookie_id
    attributes:
      - name: td_client_id
        invalid_texts: ['']
        array_elements: 5
        source_columns:
          - {table: site_aaa, order: first, order_by: td_client_id, priority: 1}
          - {table: site_xxx, order: first, order_by: td_client_id, priority: 2}
          - {table: site_yyy, order: first, order_by: td_client_id, priority: 3}
          - {table: site_zzz, order: first, order_by: td_client_id, priority: 4}
      - name: td_global_id
        valid_regexp: "3rd_*"
        invalid_texts: ['']
        source_columns:
          - {table: site_aaa, order: last, order_by: time, priority: 1}
          - {table: site_xxx, order: last, order_by: time, priority: 1}
          - {table: site_yyy, order: last, order_by: time, priority: 1}
          - {table: site_zzz, order: last, order_by: time, priority: 1}

The table specified under name: in master_tables: is output. This is used as the master_table in the Master Segment of the Audience Studio. Below is the result from this example.

Example Result
unified_cookie_idtd_client_idtd_global_idtime
tRf8bJZ_0EC1["aaa_001", "aaa_001", "aaa_001", "aaa_001", "aaa_002"]3rd_0181691675802

A record is generated for each canonical_id. In this case, there is only one record. The time column indicates when the table was created.

canonical_id_1

Apart from the canonical_id, values specified in attributes: are stored. The td_client_id column, for example, picks up a maximum of five values based on the table order and record sequence defined in source_tables:. However, as shown in this example, duplicate values may occur, so caution is needed.

enriched_ Table

In this example, the tables are named enriched_site_aaa, enriched_site_xxx, enriched_site_yyy, and enriched_site_zzz.

tables:
  - database: test_id_unification_ex1
    table: site_aaa
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: td_global_id, key: td_global_id}

  - database: test_id_unification_ex1
    table: site_xxx
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: td_global_id, key: td_global_id}

  - database: test_id_unification_ex1
    table: site_yyy
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: td_global_id, key: td_global_id}

  - database: test_id_unification_ex1
    table: site_zzz
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: td_global_id, key: td_global_id}

Tables named enriched_${source_table_name} are output (four in this example). These are used in the Master Segment as attribute_table or behavior_table.

Example Result

Below is the enriched_site_aaa table.

timesitetd_client_idtd_global_idunified_cookie_id
1672876800aaa.jpaaa_0013rd_001tRf8bJZ_0EC1
1673740800aaa.jpaaa_0013rd_002tRf8bJZ_0EC1
1674604800aaa.jpaaa_0013rd_003tRf8bJZ_0EC1
1675555200aaa.jpaaa_0013rd_004tRf8bJZ_0EC1
1676419200aaa.jp
1677283200aaa.jp
...............

canonical_id_1

This translation preserves the structure and technical accuracy of the original content while ensuring clarity in English. Let me know if you need further adjustments!

Tables to Check If Unification Was Successful

source_key_stats Table

In this example, the table is named unified_cookie_id_source_key_stats.

This table outputs statistics from the source table (e.g., the number of unique keys for each column). Only the total_distinct column is calculated from the unified_cookie_id_graph_unify_loop_0 table.

This table appends records with each execution. To retrieve the latest record, execute the following query:

SELECT *
FROM (
  SELECT *, RANK() OVER (ORDER BY time DESC) AS id
  FROM unified_cookie_id_source_key_stats
)
WHERE id = 1
ORDER BY from_table
Example Result

The id column was added by the above SQL query.

from_tabletotal_distinctdistinct_td_client_iddistinct_td_global_idtimeid
*16161816916758021
site_aaa1231016916758021
site_xxx114816916758021
site_yyy125716916758021
site_zzz134916916758021

source_key_stats

result_key_stats Table

In this example, the table is named unified_cookie_id_result_key_stats.

This table records statistics such as the number of canonical IDs (unique individuals) generated for each table after ID unification. The table appends records with each execution. To retrieve the latest record, execute the following query:

SELECT *
FROM (
  SELECT *, RANK() OVER (ORDER BY time DESC) AS id
  FROM unified_cookie_id_result_key_stats
)
WHERE id = 1
ORDER BY from_table
Example Results

The id column was added by the above SQL query.

from_tabletotal_distinctdistinct_with_td_client_iddistinct_with_td_global_idhistogram_td_client_idhistogram_td_global_idtimeid
*11116:118:116916758021
site_aaa11116:118:116916758021
site_xxx11116:118:116916758021
site_yyy11116:118:116916758021
site_zzz11116:118:116916758021

source_key_stats

Column Explanations for result_key_stats

  • from_table: Indicates the source table for the computed statistics. The "*" row represents all tables combined.
  • total_distinct: Number of unique canonical IDs in the entire table (if "*") or within each specific table.
  • distinct_with_td_client_id: Number of canonical IDs containing at least one td_client_id. Excludes IDs not stitched with td_client_id.
  • distinct_with_td_global_id: Number of canonical IDs containing at least one td_global_id. Excludes IDs not stitched with td_global_id.
  • histogram_td_client_id: Distribution of how many td_client_ids were merged into one canonical ID (e.g., 16:1 means 16 td_client_ids became one canonical ID).
  • histogram_td_global_id: Similar to histogram_td_client_id but for td_global_id.

Tables Generated in the Process of the Unification Algorithm

The following group of tables is used internally by the workflow. Some are used for tasks such as convergence judgment.

graph_unify_loop_N Table

In this example, the table is named unified_cookie_id_graph_unify_loop_0 (1, 2, 3, 4). This table represents the state of the graph in the Nth loop, frequently referenced in the algorithm's introduction. The last loop is not included here because it has a different name (graph Table).

follower_idfollower_nsleader_idleader_nsfollower_first_seen_atfollower_last_seen_atfollower_source_table_idsfollower_last_processed_attime
yyy_0021yyy_002116764192001677283200[7]16916758151691675802
3rd_0152zzz_005116849728001684972800[8, 5, 7]16916758151691675802
3rd_0152aaa_003116849728001684972800[8, 5, 7]16916758151691675802
3rd_0152yyy_004116849728001684972800[8, 5, 7]16916758151691675802
...........................
  • follower_id, leader_id Columns: These columns represent the keys for the follower and leader.
  • follower_ns, leader_ns Columns: Indicate the namespace for the follower and leader keys. The order of keys in the merge_by_keys: configuration determines the values 1, 2, etc. In this example, td_client_id is 1, and td_global_id is 2.
  • follower_first_seen_at, follower_last_seen_at Columns: Record the timestamp when the follower's key value first and last appeared (based on the time column) in the source table.
  • follower_source_table_ids Columns: An array listing the IDs of source tables containing the follower's key values. Use the tables table (unified_cookie_id_tables in this example) to determine which source table corresponds to each ID.
  • follower_last_processed_at Columns: The timestamp when the follower was last processed in the loop.

graph Table

In this example, the table is named unified_cookie_id_graph. It represents the graph's state in the final specified loop.

Tables for Mapping canonical_id to Keys

lookup Table

In this example, the table is named unified_cookie_id_lookup. This table stores all key values in a single row and allows the lookup of a key's canonical_id. It is used to create the master and enriched tables.

source_key_stats

Example Result
canonical_ididid_key_typecanonical_id_first_seen_atcanonical_id_last_seen_atid_first_seen_atid_last_seen_atid_source_table_idsid_last_processed_attime
tRf8bJZ_0EC13rd_01621672876800168765120016859232001685923200[5, 6]16916758631691675802
tRf8bJZ_0EC1aaa_00111672876800168765120016728768001675555200[5]16916758151691675802
tRf8bJZ_0EC1yyy_00311672876800168765120016806528001681516800[7]16916758511691675802
tRf8bJZ_0EC13rd_01021672876800168765120016806528001680652800[7, 5]16916758511691675802

Columns not detailed below share the same meanings as in the unified_cookie_id_graph_unify_loop_0 table.

Column Explanations for lookup

  • id_key_type Column: Indicates the key type (number) corresponding to the id. For mapping between key_type and key_name, refer to the keys table (unified_cookie_id_keys in this example).
  • id_source_table_ids Column: Lists the source tables containing the id. Use the tables table (unified_cookie_id_tables in this example) to map each table_id to a table_name.

This table maps each key to its corresponding id (1, 2, ...). It is used to identify the key_name from the id_key_type column in the lookup table.

Example Result
key_typekey_nametime
1td_client_id1691675802
2td_global_id1691675802

This table maps each source table to an assigned id. It is used to identify the table_name from the id_source_table_ids column in the lookup table.

Example Result
table_idtable_nametime
5site_aaa1691675802
6site_xxx1691675802
7site_yyy1691675802
8site_zzz1691675802