# Files Required for ID Unification Workflow

To set up a Unification Workflow, the following two files are essential:

1. **Dig file**: Used to invoke the Unification Workflow.
2. **YAML file**: Defines the data sources and stitching keys used for ID unification.


This document provides detailed instructions on how to write these files. It includes all available options, so refer to this guide if you're uncertain about specific configurations.

## Writing a dig File

The following example serves as a template for creating a `dig` file. The file primarily sets options for running the Unification Workflow (WF). You can use this as a basic template for your implementation.


```yaml
+call_unification:
  http_call>: https://api-cdp.treasuredata.com/unifications/workflow_call
  headers:
    - authorization: ${secret:td.apikey}
  method: POST
  retry: true
  content_format: json
  content:
    run_canonical_ids: true
    run_enrichments: true
    run_master_tables: true

    full_refresh: true
    keep_debug_tables: true

    unification:
      !include : unification_ex1.yml
```

### http_call Operator

This operator triggers the Unification Workflow. The endpoint URL depends on your account's region:

| Region | URL |
|  --- | --- |
| US | `https://api-cdp.treasuredata.com/unifications/workflow_call` |
| EU01 | `https://api-cdp.eu01.treasuredata.com/unifications/workflow_call` |
| Tokyo | `https://api-cdp.treasuredata.co.jp/unifications/workflow_call` |
| Korea | `https://api-cdp.ap02.treasuredata.com/unifications/workflow_call` |


### Headers

Authentication is performed using the `td.apikey` secret, which contains the Master API Key for your account. No manual setup of the secret is required because it is auto-generated for ID Unification WF.

Warning
If you manually define a `td.apikey` secret, ensure it is formatted differently than a standard API Key. Prepend `TD1 ` (with a space) to the API Key value, e.g., `TD1 24/939***`. Improper formatting may cause authentication failures.

### Content Options

The `content` section defines execution options for the ID Unification Workflow and includes a reference to a `yml` file.

| Option Name | Required? | Default Value | Description |
|  --- | --- | --- | --- |
| `keep_debug_tables` | Optional | `false` | Set to `false` to discard intermediate debug tables. |
| `run_canonical_ids` | Optional | `true` | Set to `false` to skip the canonical ID generation process. |
| `run_enrichments` | Optional | `true` | Set to `false` to skip the enrichment process for canonical IDs. |
| `run_master_tables` | Optional | `true` | Set to `false` to skip the master table generation process. |
| `full_refresh` | Optional | `true` | If `false`, performs [incremental updates](/en/tools/idu/p7_incremental) instead of full refreshes. As described in the doc, incremental_update will run full refreshes every 3 days |


Recommendation
Keep `run_*` options set to `true` to ensure all necessary processes are executed. For `keep_debug_tables`, set it to `true`. This will help to vidualize the ID Unification statitics later. Configure `full_refresh: false` only when incremental updates are required.

## Writing a yml File

The example below demonstrates how to write the referenced `yml` file (`unification_ex1.yml`) from the `dig` file example.


```yaml
name: production
keys:
  - name: td_client_id
    valid_regexp: "[0-9a-fA-F]{8}-..."
    invalid_texts: ['']

  - name: td_global_id
    valid_regexp: "[0-9a-fA-F]{8}-..."
    invalid_texts: ['', '0000000000-...']

  - name: email
    valid_regexp: ".*@.*"

tables:
  - database: prod
    table: pageviews
    incremental_columns: [updated_at, id]
    key_columns:
      - {column: td_client_id, key: td_client_id}

 - database: brand2
    table: pageviews
    as: brand2_pageviews
    key_columns:
      - {column: td_client_id, key: td_client_id}
      - {column: td_global_id, key: td_global_id}
      - {column: email, key: email}

  - database: prod
    table: contacts
    key_columns:
      - {column: email, key: email}

canonical_ids:
  - name: browser_id
    merge_by_keys: [td_client_id, td_global_id]

master_tables:
  - name: marketing_master
    canonical_id: browser_id

    attributes:
      - name: browser_id
        source_canonical_id: browser_id

      - name: email
        source_columns:
          - {table: contacts, column: email}
```

## Names

Specifies the name of this Unification. This name will be entered as the database name (`cdp_unification_${name}`) that will store the output of the ID Unification. In the above example, the `cdp_unification_production database` is created and the output is stored.

## Keys

Define the keys used for stitching IDs across all tables.

| Option Name | Required? | Description |
|  --- | --- | --- |
| `name` | Required | Key name. Use consistent naming even if column names differ across tables. |
| `valid_regexp` | Optional | Regular expression to validate key values. |
| `invalid_texts` | Optional | Array of invalid key values to exclude. Ex. `''`, `N/A` |


## Tips: Regular Expression for `td_client_id` / `td_global_id`

The values for `td_client_id` and `td_global_id` are similar to:

- `307423ab-9cbc-4bca-9cae-05c3700cc8f2`
- `fc6422e1-48bb-4396-b7aa-bdd2075c000d`


These can be uniformly expressed as the following regular expression:

- `[0-9a-fA-F]{8}-[0-9a-fA-F]{4}[0-9a-fA-F]{4}[0-9a-fA-F]{4}[0-9a-fA-F]{12}`


## Tables

Configure the tables used in the unification process.

| Option Name | Required? | Description |
|  --- | --- | --- |
| `database` | Required | Specify the database name. |
| `table` | Required | Specify the table name. |
| `key_columns` | Required | Map table columns to defined keys in Array(Map). |
| `as` | Optional | Alias for the table to avoid name collisions during output. |
| `incremental_columns` | Optional | Specify columns for incremental updates (e.g., `[updated_at, id]`) in Array(String). |


### Tips: When the column and key differ in key_columns

In most cases, the values of `column` and `key` may match. However, there are instances where the identifier recorded under the column name `cookie_id` in one table needs to be stitched together with the `td_client_id` column from another table (registered as `td_client_id` in `keys`).

For such cases, it would be written as:

- `{column: cookie_id, key: td_client_id}`


### Tips: Cases where `as:` should be used

This option is necessary in situations like the current example, where there are tables with the same name, such as `pageviews`, in different databases, and both are used as source tables. When creating an enriched table in the destination database, the table names would conflict. To avoid this collision, the `as:` option is used.

## `canonical_ids`

Defines how IDs are unified and assigns canonical identifiers.

### Example:


```yaml
canonical_ids:
  - name: browser_id
    merge_by_keys: [td_client_id, td_global_id, td_ssc_id]

  - name: marketing_id
    merge_by_canonical_ids: [browser_id]
    merge_by_keys: [email]
    source_tables: [pageviews, contacts]

  - name: contact_id
    merge_by_canonical_ids: [browser_id]
    merge_by_keys: [membership_id, email]
    merge_iterations: 3
    incremental_merge_iterations: 2
```

### Options:

| Option Name | Required? | Type | Description |
|  --- | --- | --- | --- |
| `name` | Required | String | Name of the canonical ID. |
| `merge_by_keys` | Required | Array(String) | Keys used for merging. The order specifies priority. |
| `merge_iterations` | Optional | Integer | Number of iterations for the merging algorithm. Defaults to `3`. |
| `incremental_merge_iterations` | Optional | Integer | Number of iterations for incremental updates. Defaults to `2`. |
| `merge_by_canonical_ids` | Optional | Array(String) | Used to create a canonical ID by merging other canonical IDs. |


## `master_tables`

Specifies the master output tables and attributes for unified profiles.

### Example:


```yaml
master_tables:
  - name: marketing_master
    canonical_id: browser_id
    attributes:
      - name: email
        source_columns:
          - {table: contacts, column: email}
```

### Options for `canonical_ids`

| Option Name | Required? | Value | Description |
|  --- | --- | --- | --- |
| `name:` | Required | String | Sets the name of the `canonical_id`. |
| `merge_by_keys:` | Required | String | Specifies which set of keys will be used to perform stitching (Unification Algorithm). The variation of keys and the order in which they are set (priority) are extremely important. |
| `merge_iterations:` | Optional | Integer | Sets the number of loops for the Unification Algorithm, which is processed iteratively. The appropriate number of iterations depends on the problem (data), but insufficient looping can be identified from the execution results. Even if a higher number is set, the process will stop once the algorithm does. Start with 5 or 10 iterations for testing. The default value is `3`. |
| `incremental_merge_iterations:` | Optional | Integer | Sets the number of iterations used during [incremental_update](/en/tools/idu/p8_persistent_id). The default value is `2`. |
| `source_tables:` | Optional | Array(String) | Lists specific tables from those defined in `tables:` to use only the keys from these tables for unification. If not specified, all tables are referenced. |
| `merge_by_canonical_ids:` | Optional | Array(String) | Used to create a `canonical_id` by merging other `canonical_ids` as described [here](/en/tools/idu/p6_do_not_merge). |


### (Reference) Naming a `canonical_id`

When naming a `canonical_id`, it is important to make the name clear enough to indicate what the unified ID represents. Additionally, keep the name concise, as it will appear in the output table name. For example:

- **browser_id, unified_cookie_id**
  - Used when integrating identifiers such as `td_client_id`, `td_global_id`, `td_ssc_id`, and `cookie_id`.
- **person_id, customer_id, cust_id, member_id**
  - Used when integrating keys containing personal information such as `email` or `member_id`.


### Options for `master_tables`


```yaml
master_tables:
  - name: marketing_master
    canonical_id: marketing_id
    attributes:
      - name: browser_id
        source_canonical_id: browser_id

      - name: firstname
        source_columns:
          - {table: pageviews}
          - {table: form_submits, column: first_name}

      - name: birthdate
        valid_regexp: "[0-9]{4}-[0-9]{1,2}-[0-9]{1,2}"
        invalid_texts: ['']
        source_columns:
          - {table: contacts, priority: 1}
          - {table: pageviews, priority: 2}
          - {table: form_submits, priority: 3}

      - name: all_emails
        array_elements: 5
        source_columns:
          - {table: contacts, order: last, order_by: time, priority: 1}
          - {table: pageviews, order: first, order_by: time, priority: 2}
          - {table: form_submits, order: first, order_by: time, priority: 2}
```

| Option Name | Required? | Value | Description |
|  --- | --- | --- | --- |
| `name:` | Required | String | Specifies the name of the `master_table` that will hold the `canonical_id` defined by `canonical_id:`. |
| `canonical_id:` | Required | String | Specifies the base `canonical_id` for the `master_table`. The number of rows in the `master_table` corresponds to the unique count of the `canonical_id`. Note that if the `merge_by_keys:` configuration for the `canonical_id` does not cover the key combinations across all source tables, an error will occur (since enrichment across all source tables will not be possible). |
| `attributes:` | Optional | (Deep Nests) | Additional columns from the source tables, other than the `canonical_id`, can be specified as `attributes:` to include them in the `master_table`. This option is explained in detail below. |


### Options for `attributes:`

| Option Name | Required? | Value | Description |
|  --- | --- | --- | --- |
| `name:` | Required | String | Specifies the attribute name, which will become the column name in the `master_table`. |
| `array_elements:` | Optional | Array(String) | If the specified column contains multiple values for a single `canonical_id` (e.g., `td_client_id` used as keys), this option allows the column to store multiple values as an array. Specifies the number of values to include (up to `10`). If omitted, the column will contain only a single value. |
| `source_columns:` | Required | (Deep Nests) | Since the specified attribute may span multiple tables and records, this option defines which table and record values to prioritize. If the column name in the table differs from the `name:` specified, use the `column:` option here to specify the column name. |
| `valid_regexp:` | Optional | Regular Expression | Same as the option with the same name in `keys:`. |
| `invalid_texts:` | Optional | Array(String) | Same as the option with the same name in `keys:`. |


### Options for `source_columns:`

| Option Name | Required? | Value | Description |
|  --- | --- | --- | --- |
| `table:` | Required | String | Name of the table to use as the attribute (database name is not required). |
| `column:` | Optional | String | If the column name in the table differs from the `name:` specified, specify the column name here. |
| `priority:` | Optional | Integer | Specifies the priority for selecting values when multiple tables are provided. Tables can share the same priority level. Within the same priority level, the `order:` option determines the order of records in the table. |
| `order:` | Optional | String | Specifies the sorting order for records based on the column defined in `order_by:` (default: "time"). Use `:first` for ascending (ASC) and `:last` for descending (DESC). Default is `:last`. |
| `order_by:` | Optional | String | Used with the `order:` option to specify the column for sorting. Default is "time". |


#### Notes on Setting the Same `priority:`

If multiple tables are assigned the same `priority:`, they must share the same sorting order. If different sorting orders are specified, an error will occur. For tables with the same priority level, the specified sorting is uniformly applied, and values are selected sequentially from top to bottom.