# ID Unification Tips ## Tips: Visualizing a Graph Table Using Graphviz ### Step 1: Query and Export Graph Table Data as CSV Run the following SQL query on the `graph_unify_loop_${N}` (or `graph`) table for each iteration, and download the resulting data as a CSV file. ```sql SELECT CONCAT('"',leader_id,'"') AS leader_id, CONCAT('"',follower_id,'"') AS follower_id, '' AS node_label, '' AS node_shape, '' AS edge_label FROM ${canonical_id_name}_graph_unify_loop_${N}; ``` Example output: | leader_id | follower_id | node_label | node_shape | edge_label | | --- | --- | --- | --- | --- | | """aaa_001""" | """3rd_008""" | | | | | """aaa_001""" | """3rd_013""" | | | | | """aaa_001""" | """3rd_005""" | | | | | """aaa_001""" | """xxx_004""" | | | | ### Step 2: Edit CSV File Open the CSV file in a text editor and replace all instances of `'""'` with `''`. This will clean up unnecessary quotes around the values. Edited example: | leader_id | follower_id | node_label | node_shape | edge_label | | --- | --- | --- | --- | --- | | "aaa_001" | "3rd_008" | | | | | "aaa_001" | "3rd_013" | | | | | "aaa_001" | "3rd_005" | | | | | "aaa_001" | "xxx_004" | | | | ### Step 3: Convert CSV to DOT Language Graph Ex. Use the 3rd party service - [https://docpose.com/csv-to-dot](https://docpose.com/csv-to-dot) ## Tips: Expanding and Visualizing Histogram Columns from `result_key_stats` (Example: 100 Rows) ### Overview In this guide, we process and visualize histogram columns such as `histogram_${key_name}` from the `result_key_stats` table. Using SQL, we transform these columns into a structured format and plot histograms and pie charts for insights. #### Sample `result_key_stats` Table | from_table | total_distinct | distinct_with_email | distinct_with_td_client_id | distinct_with_td_global_id | distinct_with_td_ssc_id | histogram_email | histogram_td_client_id | histogram_td_global_id | histogram_td_ssc_id | time | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | * | 1909226 | 30579 | 1909096 | 1908998 | 680373 | 1:30247,2:311,3:14,4:4,7:1,25:1,31:1 | 1:1738227,2:115270,3:26894,4:10851,5:5474,... | 1:1575567,2:166595,3:69711,4:27998,5:16761,... | 1:563326,2:96817,3:12685,4:3563,5:1579,... | 1689326177 | ### SQL Query to Transform and Normalize Histogram Data The following SQL extracts, normalizes, and formats histogram data for plotting. It includes logic to handle missing x-values and calculate percentages for each bin. ```sql WITH tbl_histogram AS ( SELECT CAST(SPLIT_PART(elm, ':', 1) AS BIGINT) AS x, CAST(SPLIT_PART(elm, ':', 2) AS BIGINT) AS num FROM ( SELECT SPLIT(histogram_${key_name}, ',') AS hist_ary, ROW_NUMBER() OVER (ORDER BY time DESC) AS id FROM ${canonical_id_name}_result_key_stats WHERE from_table = '*' ) CROSS JOIN UNNEST(hist_ary) AS t(elm) WHERE id = 1 ), tbl_serial_numbers AS ( SELECT val FROM (VALUES (SEQUENCE(1, 100, 1))) AS t(seq_ary) CROSS JOIN UNNEST(seq_ary) AS t(val) ) SELECT x, num, num * 1.0 / SUM(num) OVER (ORDER BY x RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS percentage_of_total FROM ( SELECT x, num FROM tbl_histogram UNION ALL SELECT x, 0 AS num FROM (SELECT val AS x FROM tbl_serial_numbers) WHERE x NOT IN (SELECT x FROM tbl_histogram) ) ORDER BY x LIMIT 100; ``` #### Key Components of the Query 1. **Percentage Calculation** `percentage_of_total` computes the proportion of each `x` relative to the total. 2. **Histogram Bin Range** `SEQUENCE(1, 100, 1)` ensures the x-axis ranges from 1 to 100, even if certain bins are missing. 3. **Row Limit** `LIMIT 100` restricts the output to the first 100 bins. ### Example Visualizations #### 1. `distinct_with_email` Histogram ##### Output Table | x | num | percentage_of_total | | --- | --- | --- | | 1 | 30337 | 0.99 | | 2 | 313 | 0.01 | | 3 | 14 | 0.00 | | 4 | 4 | 0.00 | | 5 | 0 | 0.00 | | ... | ... | ... | ##### Observations - Most `canonical_id`s are mapped to only one `email`. - Small counts exist for multiple `emails` per `canonical_id`. ##### Bar Graph ![canonical_id_1](/assets/10-2-1.ed93e420d18cd94e411469061e4c76d3268db85550fdb7bdf1d07f6e8ec1f3e0.9163346c.png) ##### Pie Chart ![canonical_id_1](/assets/10-2-2.9f7d50d9c5e75fdd7ab59211c6c939012b8009f68dd5d0f3c45cfe3ad3e183c7.9163346c.png) #### 2. `distinct_with_td_client_id` Histogram ##### Output Table | x | num | percentage_of_total | | --- | --- | --- | | 1 | 1690899 | 0.90 | | 2 | 121851 | 0.07 | | 3 | 28552 | 0.02 | | 4 | 11489 | 0.01 | | 5 | 5783 | 0.00 | | ... | ... | ... | ##### Observations - Majority of `canonical_id`s correspond to a single `td_client_id`. - Potential reasons: - Many users interact briefly, resulting in only one `td_client_id`. - Incomplete stitching of `td_client_id`s across sessions. ##### Bar Graph ![canonical_id_1](/assets/10-3-1.81c48e85c07933aa458117f7bd7c4ea2f64b3d474ae12ccb7fb578ea8b53fe8d.9163346c.png) ##### Pie Chart ![canonical_id_1](/assets/10-3-2.ecb047bed07e8d15e2d0c049706c48688c4e6f96e372830b0d030cde2edbfa9f.9163346c.png) ### Insights and Considerations 1. **Stitching Accuracy** Results reflect the accuracy of `canonical_id` stitching. High counts for `x=1` may indicate insufficient linking across identifiers. 2. **User Behavior Patterns** Distributions help identify how users interact with your platform (e.g., single or multiple identifiers). 3. **Data Quality** Missing or misaligned data may result in skewed histograms. By visualizing these patterns, we gain deeper insights into user identity stitching and potential areas for data enhancement.