# ID Unification Tips

## Tips: Visualizing a Graph Table Using Graphviz

### Step 1: Query and Export Graph Table Data as CSV

Run the following SQL query on the `graph_unify_loop_${N}` (or `graph`) table for each iteration, and download the resulting data as a CSV file.


```sql
SELECT
  CONCAT('"',leader_id,'"') AS leader_id,
  CONCAT('"',follower_id,'"') AS follower_id,
  '' AS node_label,
  '' AS node_shape,
  '' AS edge_label
FROM ${canonical_id_name}_graph_unify_loop_${N};
```

Example output:

| leader_id | follower_id | node_label | node_shape | edge_label |
|  --- | --- | --- | --- | --- |
| """aaa_001""" | """3rd_008""" |  |  |  |
| """aaa_001""" | """3rd_013""" |  |  |  |
| """aaa_001""" | """3rd_005""" |  |  |  |
| """aaa_001""" | """xxx_004""" |  |  |  |


### Step 2: Edit CSV File

Open the CSV file in a text editor and replace all instances of `'""'` with `''`. This will clean up unnecessary quotes around the values.

Edited example:

| leader_id | follower_id | node_label | node_shape | edge_label |
|  --- | --- | --- | --- | --- |
| "aaa_001" | "3rd_008" |  |  |  |
| "aaa_001" | "3rd_013" |  |  |  |
| "aaa_001" | "3rd_005" |  |  |  |
| "aaa_001" | "xxx_004" |  |  |  |


### Step 3: Convert CSV to DOT Language Graph

Ex. Use the 3rd party service - [https://docpose.com/csv-to-dot](https://docpose.com/csv-to-dot)

## Tips: Expanding and Visualizing Histogram Columns from `result_key_stats` (Example: 100 Rows)

### Overview

In this guide, we process and visualize histogram columns such as `histogram_${key_name}` from the `result_key_stats` table. Using SQL, we transform these columns into a structured format and plot histograms and pie charts for insights.

#### Sample `result_key_stats` Table

| from_table | total_distinct | distinct_with_email | distinct_with_td_client_id | distinct_with_td_global_id | distinct_with_td_ssc_id | histogram_email | histogram_td_client_id | histogram_td_global_id | histogram_td_ssc_id | time |
|  --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| * | 1909226 | 30579 | 1909096 | 1908998 | 680373 | 1:30247,2:311,3:14,4:4,7:1,25:1,31:1 | 1:1738227,2:115270,3:26894,4:10851,5:5474,... | 1:1575567,2:166595,3:69711,4:27998,5:16761,... | 1:563326,2:96817,3:12685,4:3563,5:1579,... | 1689326177 |


### SQL Query to Transform and Normalize Histogram Data

The following SQL extracts, normalizes, and formats histogram data for plotting. It includes logic to handle missing x-values and calculate percentages for each bin.


```sql
WITH tbl_histogram AS
(
  SELECT
     CAST(SPLIT_PART(elm, ':', 1) AS BIGINT) AS x,
     CAST(SPLIT_PART(elm, ':', 2) AS BIGINT) AS num
  FROM (
    SELECT
      SPLIT(histogram_${key_name}, ',') AS hist_ary,
      ROW_NUMBER() OVER (ORDER BY time DESC) AS id
    FROM ${canonical_id_name}_result_key_stats
    WHERE from_table = '*'
  )
  CROSS JOIN UNNEST(hist_ary) AS t(elm)
  WHERE id = 1
),
tbl_serial_numbers AS
(
  SELECT val
  FROM (VALUES (SEQUENCE(1, 100, 1))) AS t(seq_ary)
  CROSS JOIN UNNEST(seq_ary) AS t(val)
)
SELECT x, num, num * 1.0 / SUM(num) OVER (ORDER BY x RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS percentage_of_total
FROM
(
  SELECT x, num
  FROM tbl_histogram
  UNION ALL
  SELECT x, 0 AS num
  FROM (SELECT val AS x FROM tbl_serial_numbers)
  WHERE x NOT IN (SELECT x FROM tbl_histogram)
)
ORDER BY x
LIMIT 100;
```

#### Key Components of the Query

1. **Percentage Calculation**
`percentage_of_total` computes the proportion of each `x` relative to the total.
2. **Histogram Bin Range**
`SEQUENCE(1, 100, 1)` ensures the x-axis ranges from 1 to 100, even if certain bins are missing.
3. **Row Limit**
`LIMIT 100` restricts the output to the first 100 bins.


### Example Visualizations

#### 1. `distinct_with_email` Histogram

##### Output Table

| x | num | percentage_of_total |
|  --- | --- | --- |
| 1 | 30337 | 0.99 |
| 2 | 313 | 0.01 |
| 3 | 14 | 0.00 |
| 4 | 4 | 0.00 |
| 5 | 0 | 0.00 |
| ... | ... | ... |


##### Observations

- Most `canonical_id`s are mapped to only one `email`.
- Small counts exist for multiple `emails` per `canonical_id`.


##### Bar Graph

![canonical_id_1](/assets/10-2-1.ed93e420d18cd94e411469061e4c76d3268db85550fdb7bdf1d07f6e8ec1f3e0.9163346c.png)

##### Pie Chart

![canonical_id_1](/assets/10-2-2.9f7d50d9c5e75fdd7ab59211c6c939012b8009f68dd5d0f3c45cfe3ad3e183c7.9163346c.png)

#### 2. `distinct_with_td_client_id` Histogram

##### Output Table

| x | num | percentage_of_total |
|  --- | --- | --- |
| 1 | 1690899 | 0.90 |
| 2 | 121851 | 0.07 |
| 3 | 28552 | 0.02 |
| 4 | 11489 | 0.01 |
| 5 | 5783 | 0.00 |
| ... | ... | ... |


##### Observations

- Majority of `canonical_id`s correspond to a single `td_client_id`.
- Potential reasons:
  - Many users interact briefly, resulting in only one `td_client_id`.
  - Incomplete stitching of `td_client_id`s across sessions.


##### Bar Graph

![canonical_id_1](/assets/10-3-1.81c48e85c07933aa458117f7bd7c4ea2f64b3d474ae12ccb7fb578ea8b53fe8d.9163346c.png)

##### Pie Chart

![canonical_id_1](/assets/10-3-2.ecb047bed07e8d15e2d0c049706c48688c4e6f96e372830b0d030cde2edbfa9f.9163346c.png)

### Insights and Considerations

1. **Stitching Accuracy**
Results reflect the accuracy of `canonical_id` stitching. High counts for `x=1` may indicate insufficient linking across identifiers.
2. **User Behavior Patterns**
Distributions help identify how users interact with your platform (e.g., single or multiple identifiers).
3. **Data Quality**
Missing or misaligned data may result in skewed histograms.


By visualizing these patterns, we gain deeper insights into user identity stitching and potential areas for data enhancement.