# Accessing Hive 4 Support for Hive 4 is now available on the Treasure Data console. # Specifying Hive 4 on the TD Console After your account has been enabled for Hive 4 functionality, the drop-down menu for *Type* in the Queries window will include **Hive 2023.1**. Selecting this option will cause TD to process your query with Hive 4. ![image](/assets/hive2023.e1aa63b7cab7429cf969d8a5d3508225e233b2834c5b423f8e7c44bc7fdb1b0c.03d68c04.png) # Specifying Hive 4 using the TD Toolbelt Using TD Toolbelt, you can specify a Hive 4 query by including the command line option `-T hive --engine-version stable`. For example, a Hive 4 query would look something like this: ```shell td query -w -d sample_datasets "SELECT COUNT(1) FROM www_access" -T hive --engine-version stable ``` # Specifying Hive 4 with the Python Client Using the Python Client you can specify a Hive 4 query by specifying the `engine_version` as **stable**. For example you can schedule a job using Hive 4 with the following Python code: ```python import tdclient with tdclient.Client() as td: job = td.query( "sample_datasets", "SELECT COUNT(1) FROM www_access", type="hive", engine_version="stable", ) job.wait() for row in job.result(): print(repr(row)) ``` ```python import tdclient with tdclient.Client() as td: td.create_schedule( "sched-example", { "type": "hive", "engine_version": "stable", "query": "select count(1) from www_access", "database": "sample_datasets", "cron": "0 13 * * *", }, ) ``` # New SQL Function support TD has added support for the following SQL features in Hive 4: ## Quote(str) function Quotes a string to produce a result that can be used as a properly escaped data value in an SQL statement. The string is returned enclosed by single quotation marks and with each instance of backslash, single quote ('), ASCII NUL, or Control+Z preceded by a backslash. If the argument is NULL, the return value is the word “NULL” without enclosing single quotation marks. ``` SELECT str, quote(str) FROM test_quote; +-------------------+----------------------+ | str | _c1 | +-------------------+----------------------+ | Don't say "lazy" | 'Don\'t say "lazy"' | +-------------------+----------------------+ ``` ## JSON_READ(json,type) function JSON_READ parses the given json according to the given complex type specification. ``` WITH parsed AS ( SELECT json_read('[{"user_id": 1, "item_ids": [1, 5]}, {"user_id": 2, "item_ids": [2, 8]}]', 'array>>') AS users ) SELECT users[0].user_id AS user_id_0, users[0].item_ids AS item_ids_0, users[1].user_id AS user_id_0, users[1].item_ids AS item_ids_0, users FROM parsed; +------------+-------------+--------------+---------------+----------------------------------------------------+ | user_id_0 | item_ids_0 | user_id_0_1 | item_ids_0_1 | users | +------------+-------------+--------------+---------------+----------------------------------------------------+ | 1 | [1,5] | 2 | [2,8] | [{"user_id":1,"item_ids":[1,5]},{"user_id":2,"item_ids":[2,8]}] | +------------+-------------+--------------+---------------+----------------------------------------------------+ ``` ## Sinh, cosh, tanh functions The following hyperbolic built-in functions for MATH usage have been added: sinh, cosh, tanh. ## Tumbling_window(timestamp) This function rounds timestamp agr1 to the beginning of window interval where it belongs to. ``` SELECT tumbling_window(cast('2024-05-12 15:23:59' as timestamp), interval '2' MINUTES); +------------------------+ | _c0 | +------------------------+ | 2024-05-12 15:22:00.0 | +------------------------+ ``` ## CAST + FORMAT Introduced the FORMAT clause to CAST statements as well as to SQL:2016 datetime formats. Here is a limited list of datetime formats supported: YYYY, MM, DD, HH, MI, SS, YYY, YY, Y, RRRR, RR, DDD, HH12, HH24, SSSSS, FF[1..9], AM/A.M., PM/P.M., TZH, TZM. ``` SELECT cast(cast('2024-05-12 15:23:59' as timestamp) as string format 'yyyy-MM-dd'), cast('2024/05/12 15:23:59' as timestamp format 'yyyy/MM/dd HH24:MI:ss'); +-------------+------------------------+ | _c0 | _c1 | +-------------+------------------------+ | 2024-05-12 | 2024-05-12 15:23:59.0 | +-------------+------------------------+ ``` ## Array type functions * array_min(array(obj1, obj2,...)): Returns the minimum value in an array with elements for which order is supported, returns null if array is empty * array_max(array(obj1, obj2,...)): Returns the maximum value in an array with elements for which order is supported, returns null if array is empty * array_distinct(array(obj1, obj2,...)): Returns an array of the same type as the input argument where all duplicate values have been removed. * array_join(array, delimiter, replaceNull): Concatenate the elements of an array with a specified delimiter Example: * array_slice(array, start, length): Returns the subset or range of elements from an array (subarray) * array_except(array1, array2): Returns an array of the elements in array1 but not in array2, without duplicates. * array_intersect(array1, array2): Returns an array of the elements in the intersection of array1 and array2, without duplicates. * array_union(array1, array2): Returns an array of the elements in the union of array1 and array2 without duplicates. * array_remove(array, element): Removes all occurrences of element from array. ## percentile_cont, percentile_disc ``` SELECT percentile_cont(ws_sales_price, 0.5), percentile_disc(ws_sales_price, 0.5) FROM web_sales; ``` ## approx_distinct ## DataSketches functions If approximate results are acceptable, there is a class of specialized algorithms called streaming algorithms, or sketches, that can produce results orders of magnitude faster and with mathematically proven error bounds. For interactive queries there may not be other viable alternatives, and in the case of real-time analysis, sketches are the only known solution. Hive4 integrate with [Apache Datasketches](https://datasketches.apache.org/) library more closely * HyperLogLog * ds_hll_estimate * ds_hll_estimate_bounds * ds_hll_sketch * ds_hll_stringify * ds_hll_union * ds_hll_union_f * Theta Sketch * ds_theta_estimate * ds_theta_exclude * ds_theta_intersect * ds_theta_intersect_f * ds_theta_sketch * ds_theta_union * ds_theta_union_f * Tuple Sketch * ds_tuple_arrayofdouble_estimate * ds_tuple_arrayofdouble_estimate_bounds * ds_tuple_arrayofdouble_means * etc... ## typeof This function returns a string describing the type of the argument. ``` SELECT typeof('text'), typeof(1), typeof(CAST(NULL AS bigint)), typeof(array(1.0)); +---------+------+---------+----------------------+ | _c0 | _c1 | _c2 | _c3 | +---------+------+---------+----------------------+ | string | int | bigint | array | +---------+------+---------+----------------------+ ``` ## get_sql_schema This function helps to retrieve each column name and it's data type. ``` SELECT get_sql_schema('select ws_item_sk, max(ws_quantity) as max_quantity from web_sales group by ws_item_sk'); +---------------+-----------+ | col_name | col_type | +---------------+-----------+ | ws_item_sk | bigint | | max_quantity | int | +---------------+-----------+ ``` # Hive 4 Limitations and Differences The following limitations or differences exist between the stable version of Hive 2 (Hive 2020.1) and Hive 4. ## `ORDER BY` Sorts Nulls Last While not necessarily a limitation, for ORDER_BY, the `2023.1` default ascending sort order (ASC) is NULLS LAST, and the default descending sort order (DESC) is NULLS FIRST. ```sql ORDER BY name; null aaa aaa -> bbb bbb null ``` And, to use traditional sorting you can use the following keyword: ```sql ORDER BY name NULLS FIRST; ``` or ```sql ORDER BY name NULLS LAST; ``` ## The `mask_hash` UDF Algorithm Has Changed. The default mask has changed from MD5 to SHA256. ## Time Index Log Messages Have Changed In Hive 4, you will see a change in the number of log messages related to time index filtering in output log. These messages will look something like this: ``` ** WARNING: time index filtering is not set on suprith_test.million_partitions! ** This query could be very slow as a result. ** If you used 'unix_timestamp' please modify your query to use TD_SCHEDULED_TIME instead ** or rewrite the condition using TD_TIME_RANGE ** Please see https://docs.treasuredata.com/display/public/PD/Hive+Performance+Tuning#HivePerformanceTuning-LeveragingTime-basedPartitioning ``` ## New Property to Disable Sorting of Subqueries Hive3 added the `hive.remove.orderby.in.subquery` property to disable sorting in subqueries (see [HIVE-6348](https://issues.apache.org/jira/browse/HIVE-6348).This feature continues in Hive 4. Functions that rely on UDF sorting, such as TD_SESSIONIZE are not affected by this change. ## "True" and "False" Strings Resolve Differently The strings "true" and "false" can now be evaluated as TRUE/FALSE using the AbstractGenericUDAFResolver. However, strings such as "aaa" still return true, but functions like TD_AVGIF and TD_SUMIF will return different results given “false” as an argument. ``` SELECT TD_SUMIF(1, "true"), --> 1 TD_SUMIF(1, "false"), --> 1 TD_SUMIF(1, "aaa"), --> 1 ``` ``` SELECT TD_SUMIF(1, "true"), --> 1 TD_SUMIF(1, "false"), --> null TD_SUMIF(1, "aaa"), --> 1 ``` ## `TIMESTAMP WITH LOCAL TIME ZONE` is Supported While this feature is support, be aware that it is currently impossible to change the default time zone used during the execution of the query. Therefore, the “local” time zone used will always be GMT+0 (UTC). ## Cast NULL Literals in Common Table Expressions In some cases, common table expressions (CTEs) can fail if you use a `null` literal in a SELECT statement. For example, here is CTE that fails: ```sql WITH x AS (SELECT null AS mofu, 'POST' AS method) SELECT t1.method FROM sample_datasets.www_access t1 WHERE EXISTS (SELECT 1 FROM x t2 WHERE t1.method=t2.method) UNION ALL SELECT x.method FROM x ql.Driver: FAILED: SemanticException [Error 10305]: CREATE-TABLE-AS-SELECT creates a VOID type, please use CAST to specify the type, near field: mofu org.apache.hadoop.hive.ql.parse.SemanticException: CREATE-TABLE-AS-SELECT creates a VOID type, please use CAST to specify the type, near field: mofu ``` You should explicitly cast the `null` literal to the expected type. For example, in this CTE `null` is cast to type string: ```sql WITH x AS (SELECT CAST(null AS STRING) AS mofu, 'POST' AS method) ``` ## Stateful UDFs Can Only Be Invoked in the SELECT List Stateful UDFs such as TD_X_RANK cannot be used anywhere other than in the SELECT list. For example, specifying TD_X_RANK in the HAVING clause causes this query to fail: ```sql SELECT col1, col2 FROM table HAVING TD_X_RANK(col2) < 10 FAILED: SemanticException [Error 10084]: Stateful UDF's can only be invoked in the SELECT list org.apache.hadoop.hive.ql.parse.SemanticException: Stateful UDF's can only be invoked in the SELECT list ``` To fix this you use a window function with a QUALIFY filter. ```sql SELECT col1, col2, ROW_NUMBER() OVER(PARTITION BY col1 ORDER BY col2 DESC) AS rank FROM table QUALIFY rank < 10 ``` ``` SELECT col1, col2, FROM table QUALIFY ROW_NUMBER() OVER(PARTITION BY col1 ORDER BY col2 DESC) < 10 ``` ## Aliasing `HAVING` with CTE In some cases, due to query compilation optimizations introduced in Hive 2023.1, HAVING clauses that reference aliases inside CTEs no longer work. ```sql WITH x ( SELECT col1, COUNT(*) AS num FROM table1 GROUP BY 1 HAVING NUM > 10 ) SELECT * FROM table2 WHERE col1 IN (SELECT col1 FROM x) FAILED: CalciteSubquerySemanticException Encountered Select alias 'num' in having clause 'num > 10' This non standard behavior is not supported with cbo on. Turn off cbo for these queries. ``` Rewrite your CTE so the HAVING clause does not reference an alias. ```sql WITH x ( SELECT col1, COUNT(*) AS num FROM table1 GROUP BY 1 HAVING COUNT(*) > 10 ) SELECT * FROM table2 WHERE col1 IN (SELECT col1 FROM x) ``` ## Nested Parentheses in a Sub-query Don't Parse Nested parentheses within subqueries are no longer allowed in Hive 2023.1. ```sql SELECT col1, AVG(col2) FROM ( ( SELECT col1, col2 FROM table WHERE time = 252691200 ) ) sub GROUP BY col1 org.apache.hadoop.hive.ql.parse.ParseException: line 6:0 cannot recognize input near ')' 'sub' 'GROUP' in subquery source ``` Rewrite your subquery without nesting parentheses. ```sql SELECT col1, AVG(col2) FROM ( SELECT col1, col2 FROM table WHERE time = 252691200 ) sub GROUP BY col1 ``` ## YEAR UDF cannot be a numeric type 2023.1 throws a syntax error when you specify a numeric value for YEAR. The same applies to other UDFs(e.g. MONTH, DAY, HOUR, MINUTE, SECOND and WEEKOFYEAR) that handle dates. The value you specify for these UDFs must be a timestamp string. ```sql SELECT YEAR(2024) FAILED: SemanticException [Error 10016]: Line 1:12 Argument type mismatch '2024': year does not take INT type ``` To resolve this issue specify a valid timestamp string. ```sql SELECT YEAR('2024-01-01') ``` ## TD UDFs with a database name {any db name}.TD_••• succeeds with 2020.1, but fails with 2023.1. Here is an example of how TD_URL_DECODE with a database name fails in 2023.1: ```sql SELECT sample_datasets.TD_URL_DECODE(path) FROM sample_datasets.www_access FAILED: SemanticException [Error 10011]: Invalid function sample_datasets.TD_URL_DECODE ``` You can avoid the problem by removing the database name from the UDF identifier. ```sql SELECT TD_URL_DECODE(path) FROM sample_datasets.www_access ``` ## Reserved Keywords of 2023.1 Hive 2023.1 introduced the following reserved keywords. If you use any of these keywords as table names or column names, you will need to quote them with backticks. - ANY - APPLICATION - COMPACTIONID - CONNECTORa - CONNECTORS - CONVERT - DDL - DEC - FORCE - LEADING - NUMERIC - OLDER - PKFK_JOIN - PREPARE - QUALIFY - REAL - SOME - SYNC - TIMESTAMPLOCALTZ - THAN - TRAILING - UNIQUE The keywords for Hive 2023.1 are almost identical to the [Keywords for Hive 4.0.0](https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl#LanguageManualDDL-Keywords,Non-reservedKeywordsandReservedKeywords) except for `TIME` which is non-reserved in the Treasure Data environment. | Version | Non-reserved Keywords | Reserved Keywords | | --- | --- | --- | | Hive 1.2.0 | ADD, ADMIN, AFTER, ANALYZE, ARCHIVE, ASC, BEFORE, BUCKET, BUCKETS, CASCADE, CHANGE, CLUSTER, CLUSTERED, CLUSTERSTATUS, COLLECTION, COLUMNS, COMMENT, COMPACT, COMPACTIONS, COMPUTE, CONCATENATE, CONTINUE, DATA, DATABASES, DATETIME, DAY, DBPROPERTIES, DEFERRED, DEFINED, DELIMITED, DEPENDENCY, DESC, DIRECTORIES, DIRECTORY, DISABLE, DISTRIBUTE, ENABLE, ESCAPED, EXCLUSIVE, EXPLAIN, EXPORT, FIELDS, FILE, FILEFORMAT, FIRST, FORMAT, FORMATTED, FUNCTIONS, HOLD_DDLTIME, HOUR, IDXPROPERTIES, IGNORE, INDEX, INDEXES, INPATH, INPUTDRIVER, INPUTFORMAT, ITEMS, JAR, KEYS, LIMIT, LINES, LOAD, LOCATION, LOCK, LOCKS, LOGICAL, LONG, MAPJOIN, MATERIALIZED, METADATA, MINUS, MINUTE, MONTH, MSCK, NOSCAN, NO_DROP, OFFLINE, OPTION, OUTPUTDRIVER, OUTPUTFORMAT, OVERWRITE, OWNER, PARTITIONED, PARTITIONS, PLUS, PRETTY, PRINCIPALS, PROTECTION, PURGE, READ, READONLY, REBUILD, RECORDREADER, RECORDWRITER, REGEXP, RELOAD, RENAME, REPAIR, REPLACE, REPLICATION, RESTRICT, REWRITE, RLIKE, ROLE, ROLES, SCHEMA, SCHEMAS, SECOND, SEMI, SERDE, SERDEPROPERTIES, SERVER, SETS, SHARED, SHOW, SHOW_DATABASE, SKEWED, SORT, SORTED, SSL, STATISTICS, STORED, STREAMTABLE, STRING, STRUCT, TABLES, TBLPROPERTIES, TEMPORARY, TERMINATED, TINYINT, TOUCH, TRANSACTIONS, UNARCHIVE, UNDO, UNIONTYPE, UNLOCK, UNSET, UNSIGNED, URI, USE, UTC, VIEW, WHILE, YEAR | ALL, ALTER, AND, ARRAY, AS, AUTHORIZATION, BETWEEN, BIGINT, BINARY, BOOLEAN, BOTH, BY, CASE, CAST, CHAR, COLUMN, CONF, CREATE, CROSS, CUBE, CURRENT, CURRENT_DATE, CURRENT_TIMESTAMP, CURSOR, DATABASE, DATE, DECIMAL, DELETE, DESCRIBE, DISTINCT, DOUBLE, DROP, ELSE, END, EXCHANGE, EXISTS, EXTENDED, EXTERNAL, FALSE, FETCH, FLOAT, FOLLOWING, FOR, FROM, FULL, FUNCTION, GRANT, GROUP, GROUPING, HAVING, IF, IMPORT, IN, INNER, INSERT, INT, INTERSECT, INTERVAL, INTO, IS, JOIN, LATERAL, LEFT, LESS, LIKE, LOCAL, MACRO, MAP, MORE, NONE, NOT, NULL, OF, ON, OR, ORDER, OUT, OUTER, OVER, PARTIALSCAN, PARTITION, PERCENT, PRECEDING, PRESERVE, PROCEDURE, RANGE, READS, REDUCE, REVOKE, RIGHT, ROLLUP, ROW, ROWS, SELECT, SET, SMALLINT, TABLE, TABLESAMPLE, THEN, TIMESTAMP, TO, TRANSFORM, TRIGGER, TRUE, TRUNCATE, UNBOUNDED, UNION, UNIQUEJOIN, UPDATE, USER, USING, UTC_TMESTAMP, VALUES, VARCHAR, WHEN, WHERE, WINDOW, WITH | | Hive 2.0.0 | *removed:* HOLD_DDLTIME, IGNORE, NO_DROP, OFFLINE, PROTECTION, READONLY, REGEXP, RLIKE*added:* AUTOCOMMIT, ISOLATION, LEVEL, OFFSET, SNAPSHOT, TRANSACTION, WORK, WRITE | *added:* COMMIT, ONLY, REGEXP, RLIKE, ROLLBACK, START | | Hive 2.1.0 | *added:* ABORT, KEY, LAST, NORELY, NOVALIDATE, NULLS, RELY, VALIDATE | *added:* CACHE, CONSTRAINT, FOREIGN, PRIMARY, REFERENCES | | Hive 2.2.0 | *removed:* MINUS*added:* CACHE, DAYS, DAYOFWEEK, DUMP, HOURS, MATCHED, MERGE, MINUTES, MONTHS, QUARTER, REPL, SECONDS, STATUS, VIEWS, WEEK, WEEKS, YEARS | *removed:* CACHE*added:* EXCEPT, EXTRACT, FLOOR, GROUPING, INTEGER, MINUS, PRECISION | | Hive 2.3.0 | *removed:* MERGE*added:* DETAIL, EXPRESSION, OPERATOR, SUMMARY, VECTORIZATION, WAIT | *added:* MERGE | | Hive 3.0.0 | *removed:* PRETTY*added:* ACTIVATE, ACTIVE, ALLOC_FRACTION, CHECK, DEFAULT, DO, ENFORCED, KILL, MANAGEMENT, MAPPING, MOVE, PATH, PLAN, PLANS, POOL, QUERY, QUERY_PARALLELISM, REOPTIMIZATION, RESOURCE, SCHEDULING_POLICY, UNMANAGED, WORKLOAD, ZONE | *removed:* PARTIALSCAN*added:* ANY, APPLICATION, DEC, NUMERIC, SYNC, TIME, TIMESTAMPLOCALTZ, UNIQUE | | Hive 3.1.0 | N/A | N/A | | Hive 4.0.0 | *added:* AST, AT, BRANCH, CBO, COST, CRON, DCPROPERTIES, DEBUG, DISABLED, DISTRIBUTED, ENABLED, EVERY, EXECUTE, EXECUTED, EXPIRE_SNAPSHOTS, IGNORE, JOINCOST, MANAGED, MANAGEDLOCATION, OPTIMIZE, REMOTE, RESPECT, RETAIN, RETENTION, SCHEDULED, SET_CURRENT_SNAPSHOT, SNAPSHOTS, SPEC, SYSTEM_TIME, SYSTEM_VERSION, TAG, TRANSACTIONAL, TRIM, TYPE, UNKNOWN, URL, WITHIN | *added:* COMPACTIONID, CONNECTOR, CONNECTORS, CONVERT, DDL, FORCE, LEADING, OLDER, PKFK_JOIN, PREPARE, QUALIFY, REAL, SOME, THAN, TRAILING | | 2023.1 | *added:* TIME | *removed:* TIME |