# Hivemall Function Reference

Apache Hivemall is a collection of machine learning algorithms and versatile data analytics functions.

**Determining Your Version of Hivemall**
You can determine your version of Hivemall by running the following syntax from the query window within the TD Console:


```
SELECT HIVEMALL_VERSION()
```

## Approximate Functions

### APPROX_COUNT_DISTINCT

**Signature**


```
approx_count_distinct(column)
approx_distinct(column)
```

**Description**

`APPROX_COUNT_DISTINCT` and its alias `APPROX_DISTINCT` approximately compute the number of unique elements in a column. This function corresponds to Trino(Presto)’s `APPROX_DISTINCT`. See [Hivemall documentation](https://hivemall.github.io/misc/approx.html) for details.

## Natural Language and Text Processing Functions

### BASE91

**Signature**


```
base91(binary)
```

**Description**

`BASE91` converts the argument from binary to a BASE91 string.

**Example**


```
SELECT base91(deflate('aaaaaaaaaaaaaaaabbbbccc'));
> AA+=kaIM|WTt!+wbGAA
```

### UNBASE91

**Signature**


```
unbase91(string)
```

**Description**

`UNBASE91` converts a BASE91 string to a binary.

**Example**


```
SELECT inflate(unbase91(base91(deflate('aaaaaaaaaaaaaaaabbbbccc'))));
> aaaaaaaaaaaaaaaabbbbccc
```

### IS_STOPWORD

**Signature**


```
is_stopword(string word)
```

**Description**

`IS_STOPWORD` determines whether a word is an English stop word or not. Stop words are words that are filtered out before or after the processing of text.

### NORMALIZE_UNICODE

**Signature**


```
normalize_unicode(string str [, string form])
```

**Description**

`NORMALIZE_UNICODE` transforms str to the specified normalization form. The form takes one of NFC (default), NFD, NFKC, or NFKD.

**Example**


```
select normalize_unicode('ﾊﾝｶｸｶﾅ','NFKC');
ハンカクカナ
select normalize_unicode('㈱㌧㌦Ⅲ','NFKC');
(株)トンドルIII
```

### SINGULARIZE

**Signature**


```
singularize(string word)
```

**Description**

`SINGULARIZE` returns the singular form of a given English word. For more information, see [Hivemall User Guide](https://hivemall.github.io/misc/tokenizer.html).

### SPLIT_WORDS

**Signature**


```
split_words(string query [, string regex])
```

**Description**

`SPLIT_WORDS` returns an array that contains split strings.

### WORD_NGRAMS

**Signature**


```
word_ngrams(array<string> words, int minSize, int maxSize)
```

**Description**

`WORD_NGRAMS` returns list of n-grams where minSize <= n <= maxSize. For more information, see [Hivemall User Guide](https://hivemall.github.io/misc/generic_funcs.html#text-processing).

### TOKENIZE

**Signature**


```
tokenize(string englishText [, boolean toLowerCase])
```

**Description**

`TOKENIZE` returns the words in an array.

### TOKENIZE_JA

**Signature**


```
tokenize_ja(String line [, const string mode = "normal", const list<string> stopWords, const list<string> stopTags, const array<string> userDict (or string userDictURL)])
```

**Description**

`TOKENIZE_JA` returns tokenized strings in an array. You can use a given predefined dictionary as an array or as a URL to a file uploaded to somewhere like Amazon S3. See [Hivemall User Guide](https://hivemall.github.io/misc/tokenizer.html#custom-dictionary)

If you have restricted access in your Amazon S3, you must allow access to it from Treasure Data. The TOKENIZE_JA function can be used to allow access even if your S3 environment is restricted. The static IPs used to allow access must be the same as those used for your Result Workers. If necessary, contact [TD Support](https://docs.treasuredata.com/articles/#!pd/Contacting-Treasure-Data-Support).

If your Amazon S3 does not restrict access, no configuration using the TOKENIZE_JA function is required.

When your CSV file for the custom dictionary has duplicate entries, your job fails as an
`org.apache.hadoop.hive.ql.exec.UDFArgumentException` error. It is important to resolve the duplication in advance.

**Example**


```
select tokenize_ja("kuromojiを使った分かち書きのテストです。第二引数にはnormal/search/extendedを指定できます。デフォルトではnormalモードです。");
["kuromoji","使う","分かち書き","テスト","第","二","引数","normal","search","extended","指定","デフォルト","normal"," モード"]

select tokenize_ja("関西国際空港", "normal", null, null,
                   array("関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,カスタム名詞"));
["関西","国際","空港"]

-- stoptags_exclude is useful for an include rule of part-of-speech information
select tokenize_ja("kuromojiを使った分かち書きのテストです。", "normal", array("kuromoji"), stoptags_exclude(array("名詞")));
["分かち書き","テスト"]

-- using pre-defined library
select tokenize_ja("関西国際空港", "normal", null, null, "https://raw.githubusercontent.com/atilika/kuromoji/909fd6b32bf4e9dc86b7599de5c9b50ca8f004a1/kuromoji-core/src/test/resources/userdict.txt");
["関西","国際","空港"]
```

You can get Part-of-Speech (PoS) information using -pos option as follows:


```
WITH tmp as (
  select
    tokenize_ja('kuromojiを使った分かち書きのテストです。','-mode search -pos') as r
)
select
  r.tokens,
  r.pos,
  r.tokens[0] as token0,
  r.pos[0] as pos0
from
  tmp;
```

| tokens | pos | token0 | pos0 |
|  --- | --- | --- | --- |
| ["kuromoji","使う","分かち書き","テスト"] | ["名詞-一般","動詞-自立","名詞-一般","名詞-サ変接続"] | kuromoji | 名詞-一般 |


You can get the complete list of stopTags by


```
select stoptags_exclude(array());
```

### TOKENIZE_JA_NEOLOGD

**Signature**


```
tokenize_ja_neologd(String line [, const string mode = "normal", const list<string> stopWords, const list<string> stopTags, const array<string> userDict (or string userDictURL)])
```

**Description**

`TOKENIZE_JA_NEOLOGD` returns tokenized strings in an array by using the NEologd dictionary.

`mecab-ipadic-NEologd` is a customized system dictionary for MeCab; it includes new words that are extracted from many different language resources on the Web. For more details, see [Hive Japanese NLP UDFs with NEologd](https://hivemall.github.io/misc/tokenizer.html)

**Example**


```
select tokenize_ja_neologd(); -- returns current UDF version with corresponding NEologd version date
["0.1.0-20180524"]

select tokenize_ja_neologd("彼女はペンパイナッポーアッポーペンと恋ダンスを踊った。");
["彼女","ペンパイナッポーアッポーペン","恋ダンス","踊る"]
```

### TOKENIZE_CN

**Signature**


```
tokenize_cn(string line, optional const array<string> stopWords)
```

**Description**

Simplified Chinese text tokenizer UDF uses SmartChineseAnalyzer.

**Example**


```
select tokenize_cn("Smartcn为Apache2.0协议的开源中文分词系统，Java语言编写，修改的中科院计算所ICTCLAS分词系统。");
[smartcn, 为, apach, 2, 0, 协议, 的, 开源, 中文, 分词, 系统, java, 语言, 编写, 修改, 的, 中科院, 计算, 所, ictcla, 分词, 系统]
```

### TOKENIZE_KO

**Signature**


```
tokenize_ko(
       String line [, const string mode = "discard" (or const string opts),
       const array<string> stopWords,
       const array<string>
       stopTags,
       const array<string> userDict (or const string userDictURL)]
) - returns tokenized strings in array<string>
```

**Description**

The Korean tokenizer internally uses [lucene-analyzers-nori](https://www.slideshare.net/elasticsearch/nori-the-official-elasticsearch-plugin-for-korean-language-analysis) for tokenization.

For additional usage help:


```
select tokenize_ko("", "-help");

usage: tokenize_ko(String line [, const string mode = "discard" (or const
       string opts), const array<string> stopWords, const array<string>
       stopTags, const array<string> userDict (or const string
       userDictURL)]) - returns tokenized strings in array<string> [-help]
       [-mode <arg>] [-outputUnknownUnigrams]
-helpShow function help
-mode <arg>              The tokenization mode. One of ['node', 'discard'
      (default), 'mixed']
-outputUnknownUnigrams   outputs unigrams for unknown words.
```

For the 2nd argument, instead of mode, you can use options starting with `-`. For detailed options, refer to the [Lucene API](https://hivemall.github.io/misc/tokenizer.html#korean-tokenizer) document. `none`, `discord` (default), or `mixed` are supported for the mode argument.

**Examples**


```
-- show version of lucene-analyzers-nori
select tokenize_ko();
> 8.8.2

select tokenize_ko('중요한 새 기능을 개발해줘서 정말 고마워요!');
> ["중요","기능","개발","주","고맙"]

-- explicitly using default options
select tokenize_ko('중요한 새 기능을 개발해줘서 정말 고마워요!', '-mode discard',
  -- stopwords (null to use default)
  -- see https://github.com/apache/incubator-hivemall/blob/master/nlp/src/main/resources/hivemall/nlp/tokenizer/ext/stopwords-ko.txt
  null,
  -- stoptags
  -- see https://lucene.apache.org/core/8_8_2/analyzers-nori/org/apache/lucene/analysis/ko/POS.Tag.html
  array(
   'E',   -- Verbal endings
   'IC',  -- Interjection
   'J',   -- Ending Particle
   'MAG', -- General Adverb
   'MAJ', -- Conjunctive adverb
   'MM',  -- Determiner
   'SP',  -- Space
   'SSC', -- Closing brackets
   'SSO', -- Opening brackets
   'SC',  -- Separator
   'SE',  -- Ellipsis
   'XPN', -- Prefix
   'XSA', -- Adjective Suffix
   'XSN', -- Noun Suffix
   'XSV', -- Verb Suffix
   'UNA', -- Unknown
   'NA',  -- Unknown
   'VSV'  -- Unknown
  )
);
> ["중요","기능","개발","주","고맙"]

-- None mode, without General Adverb (MAG)
select tokenize_ko('중요한 새 기능을 개발해줘서 정말 고마워요!',
  -- No decomposition for compound.
  '-mode none',
  -- stopwords (null to use default)
  null,
  array(
   'E',   -- Verbal endings
   'IC',  -- Interjection
   'J',   -- Ending Particle
   -- 'MAG', -- General Adverb
   'MAJ', -- Conjunctive adverb
   'MM',  -- Determiner
   'SP',  -- Space
   'SSC', -- Closing brackets
   'SSO', -- Opening brackets
   'SC',  -- Separator
   'SE',  -- Ellipsis
   'XPN', -- Prefix
   'XSA', -- Adjective Suffix
   'XSN', -- Noun Suffix
   'XSV', -- Verb Suffix
   'UNA', -- Unknown
   'NA',  -- Unknown
   'VSV'  -- Unknown
  )
);
> ["중요","기능","개발","줘서","정말","고마워요"]

-- discard mode: Decompose compounds and discards the original form (default).
-- https://lucene.apache.org/core/8_8_2/analyzers-nori/org/apache/lucene/analysis/ko/KoreanTokenizer.DecompoundMode.html
select tokenize_ko('중요한 새 기능을 개발해줘서 정말 고마워요!', '-mode discard');
> ["중요","기능","개발","주","고맙"]

-- default stopward (null), with stoptags
select tokenize_ko('중요한 새 기능을 개발해줘서 정말 고마워요!', '-mode discard', null, array('E', 'VV'));
> ["중요","하","새","기능","을","개발","하","주","정말","고맙"]

-- mixed mode: Decompose compounds and keeps the original form.
select tokenize_ko('중요한 새 기능을 개발해줘서 정말 고마워요!', 'mixed');
> ["중요","기능","개발","줘서","주","고마워요","고맙"]

select tokenize_ko('중요한 새 기능을 개발해줘서 정말 고마워요!', '-mode mixed');
> ["중요","기능","개발","줘서","주","고마워요","고맙"]

-- node mode: No decomposition for compound.
select tokenize_ko('중요한 새 기능을 개발해줘서 정말 고마워요!', '-mode none');
> ["중요","기능","개발","줘서","고마워요"]

select tokenize_ko('Hello, world.', '-mode none');
> ["hello","world"]

select tokenize_ko('Hello, world.', '-mode none -outputUnknownUnigrams');
> ["h","e","l","l","o","w","o","r","l","d"]

select tokenize_ko('나는 C++ 언어를 프로그래밍 언어로 사랑한다.', '-mode discard');
> ["나","c","언어","프로그래밍","언어","사랑"]

select tokenize_ko('나는 C++ 언어를 프로그래밍 언어로 사랑한다.', '-mode discard', array(), null);
> ["나","는","c","언어","를","프로그래밍","언어","로","사랑","하","ᆫ다"]

-- default stopward (null), default stoptags (null)
select tokenize_ko('나는 C++ 언어를 프로그래밍 언어로 사랑한다.', '-mode discard');
select tokenize_ko('나는 C++ 언어를 프로그래밍 언어로 사랑한다.', '-mode discard', null, null);
> ["나","c","언어","프로그래밍","언어","사랑"]

-- no stopward (empty array), default stoptags (null)
select tokenize_ko('나는 C++ 언어를 프로그래밍 언어로 사랑한다.', '-mode discard', array());
select tokenize_ko('나는 C++ 언어를 프로그래밍 언어로 사랑한다.', '-mode discard', array(), null);
> ["나","c","언어","프로그래밍","언어","사랑"]

-- no stopward (empty array), no stoptags (emptry array), custom dict
select tokenize_ko('나는 C++ 언어를 프로그래밍 언어로 사랑한다.', '-mode discard', array(), array(), array('C++'));
> ["나","는","c++","언어","를","프로그래밍","언어","로","사랑","하","ᆫ다"]

> -- default stopward (null), default stoptags (null), custom dict
select tokenize_ko('나는 C++ 언어를 프로그래밍 언어로 사랑한다.', '-mode discard', null, null, array('C++'));
> ["나","c++","언어","프로그래밍","언어","사랑"]
```

#### Custom Dictionary

The fifth argument`userDictURL` enables you to register a user-defined custom dictionary placed in http/https accessible external site. Learn more about the [custom dictionary format here](https://hivemall.github.io/misc/tokenizer.html#custom-dictionary).


```
`select tokenize_ko('나는 c++ 프로그래밍을 즐긴다.', '-mode discard', null, null, 'https://raw.githubusercontent.com/apache/lucene/main/lucene/analysis/nori/src/test/org/apache/lucene/analysis/ko/userdict.txt');

> ["나","c++","프로그래밍","즐기"]
```

Info
The custom dictionary must be be accessible through HTTP or HTTPS.

Treasure Data recommends that it be compressed using gzip with a `.gz` suffix because the maximum dictionary size is limited to 32MB and the read timeout is set to 60 seconds.

The connection must be established in 10 seconds.

## GeoSpatial User Defined Functions

### TILE

**Signature**


```
tile(double lat, double lon, int zoom)
```

**Description**

`TILE` returns a tile number in `xtile(lon,zoom) + ytile(lat,zoom) * 2^z`. The tile number is in the  range `[0,2^2z]`. For more information, see [Hivemall User Guide](https://hivemall.github.io/geospatial/latlon.html#tile-number-function)

### HAVERSINE_DISTANCE

**Signature**


```
haversine_distance(double lat1, double lon1, double lat2, double lon2, [const boolean mile=false])
```

**Description**

`HAVERSINE_DISTANCE` returns the Haversine distance between two Geo locations.For more information, see [Hivemall User Guide](https://hivemall.github.io/geospatial/latlon.html#distance-function).

**Example**


```
-- Tokyo (lat: 35.6833, lon: 139.7667), Osaka (lat: 34.6603, lon: 135.5232)
select
  haversine_distance(35.6833, 139.7667, 34.6603, 135.5232) as km,
  haversine_distance(35.6833, 139.7667, 34.6603, 135.5232, true) as mile;
402.09212137829684    249.8484608500711
```

## Aggregate Functions

### MAJORITY_VOTE

**Signature**


```
majority_vote(Primitive x)
```

**Description**

Returns the most frequent value.

**Example**


```
WITH data as (
select
explode(array('1', '2', '2', '2', '5', '4', '1', '2')) as k
)
select
majority_vote(k) as k
from
data;
> 2
```

### MAX_BY

**Signature**


```
max_by(x, y)
```

**Description**

Returns the value of x associated with the maximum value of y over all input values.

**Example**


```
WITH data as (
select 'jake' as name, 18 as age
union all
select 'tom' as name, 64 as age
union all
select 'lisa' as name, 32 as age
)
select
max_by(name, age) as name
from
data;
> tom
```

### MIN_BY

**Signature**


```
min_by(x, y)
```

**Description**

Returns the value of x associated with the minimum value of y over all input values.

**Example**


```
WITH data as (
select 'jake' as name, 18 as age
union all
select 'tom' as name, 64 as age
union all
select 'lisa' as name, 32 as age
)
select
min_by(name, age) as name
from
data;
> jake
```

## Array Functions

### ARANGE

**Signature**


```
arange([int start=0, ] int stop, [int step=1])
```

**Description**

Return evenly spaced values within a given interval.

**Example**


```
select arange(5), arange(1, 5), arange(1, 5, 1), arange(0, 5, 1);
> [0,1,2,3,4] [1,2,3,4] [1,2,3,4] [0,1,2,3,4]

select arange(1, 6, 2);
> 1, 3, 5

select arange(-1, -6, 2);
> -1, -3, -5
```

### ARGMAX

**Signature**


```
argmax(array<T> a)
```

**Description**

Return the first index of the maximum value

**Example**


```
select argmax(array(5,2,0,1));
> 0
```

### ARGMIN

**Signature**


```
argmin(array<T> a)
```

**Description**

Return the first index of the minimum value

**Example**


```
SELECT argmin(array(5,2,0,1));
> 2
```

### ARGRANK

**Signature**


```
argrank(array<ANY> a)
```

**Description**

Return the indices that would sort an array.

**Example**


```
select argrank(array(5,2,0,1)), argsort(argsort(array(5,2,0,1)));
> [3, 2, 0, 1] [3, 2, 0, 1]
```

### ARGSORT

**Signature**


```
argsort(array<ANY> a)
```

**Description**

Return the indices that would sort an array.

**Example**


```
select argsort(array(5,2,0,1));
> 2, 3, 1, 0
```

### ARRAY_APPEND

**Signature**


```
array_append(array<T> arr, T elem)
```

**Description**

Append an element to the end of an array.

**Example**


```
SELECT array_append(array(1,2),3);
> 1,2,3

SELECT array_append(array('a','b'),'c');
> "a","b","c"
```

### ARRAY_AVG

**Signature**


```
array_avg(array<number>)
```

**Description**

Returns an array<double> where each element is the mean of a set of numbers. This is an aggregate function.

**Example**


```
WITH input as (
select array(1.0, 2.0, 3.0) as nums
UNION ALL
select array(2.0, 3.0, 4.0) as nums
)
select
array_avg(nums)
from
input

> ["1.5","2.5","3.5"]
```

### 6.8. ARRAY_CONCAT

**Signature**


```
array array_concat(array<ANY> x1, array<ANY> x2, ..)
```

**Description**

The `ARRAY_CONCAT` function returns a concatenated array.

**Example**


```
select array_concat(array(1),array(2,3))
> [1,2,3]
```

### ARRAY_FLATTEN

**Signature**


```
array_flatten(array<array<ANY>>)
```

**Description**

Returns an array with the elements flattened.

**Example**


```
SELECT array_flatten(array(array(1,2,3),array(4,5),array(6,7,8)));
> [1,2,3,4,5,6,7,8]
```

### ARRAY_INTERSECT

**Signature**


```
array_intersect(array<ANY> x1, array<ANY> x2, ..)
```

**Description**

The `ARRAY_INTERSECT` function returns an intersect of given arrays.

**Example**


```
select array_intersect(array(1,3,4),array(2,3,4),array(3,5))
> [3]
```

### ARRAY_REMOVE

**Signature**


```
array_remove(array<int|text> original, int|text|array<int> target)
```

**Description**

`ARRAY_REMOVE` returns an array where the target is removed from the original array.

**Example**


```
select array_remove(array(1,null,3),array(1));
> [null,3]

select array_remove(array("aaa","bbb"),"bbb");
> ["aaa"]
```

### ARRAY_SLICE

**Signature**


```
array_slice(array<ANY> values, int offset [, int length])
```

**Description**

Slices the given array by the given offset and length parameters.

**Example**


```
SELECT
array_slice(array(1,2,3,4,5,6),2,4),
array_slice(
array("zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"),
0, -- offset
2 -- length
),
array_slice(
array("zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"),
6, -- offset
3 -- length
),
array_slice(
array("zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"),
6, -- offset
10 -- length
),
array_slice(
array("zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"),
6 -- offset
),
array_slice(
array("zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"),
-3 -- offset
),
array_slice(
array("zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"),
-3, -- offset
2 -- length
);

> [3,4]
> ["zero","one"]
> ["six","seven","eight"]
> ["six","seven","eight","nine","ten"]
> ["six","seven","eight","nine","ten"]
> ["eight","nine","ten"]
> ["eight","nine"]
```

### ARRAY_SUM

**Signature**


```
array_sum(array<NUMBER>)
```

**Description**

`ARRAY_SUM` returns an array where each element is summed up.

**Example**


```
WITH input as (
select array(1.0, 2.0, 3.0) as nums
UNION ALL
select array(2.0, 3.0, 4.0) as nums
)
select
array_sum(nums)
from
input

> ["3.0","5.0","7.0"]
```

### ARRAY_TO_STR

**Signature**


```
array_to_str(array arr [, string sep=','])
```

**Description**

Converts an array to a string using a separator.

**Example**


```
SELECT array_to_str(array(1,2,3),'-');
> 1-2-3
```

### ARRAY_UNION

**Signature**


```
array_union(array1, array2, ...)
```

**Description**

Returns the union of a set of arrays.

**Example**


```
SELECT array_union(array(1,2),array(1,2));
> [1,2]

SELECT array_union(array(1,2),array(2,3),array(2,5));
> [1,2,3,5]
```

### CONDITIONAL_EMIT

**Signature**


```
conditional_emit(array<boolean> conditions, array<primitive> features)
```

**Description**

Emit the features of a row according to various conditions.

**Example**


```
WITH INPUT AS(
SELECT
ARRAY(TRUE,
FALSE,
TRUE) AS conditions,
ARRAY("one",
"two",
"three") AS features
UNION
ALL SELECT
ARRAY(TRUE,
TRUE,
FALSE) AS conditions,
ARRAY("four",
"five",
"six") AS features
) SELECT
conditional_emit(
conditions,
features
)
FROM
INPUT
;
```

### ELEMENT_AT

**Signature**


```
element_at(array<T> list, int pos)
```

**Description**

Returns an element at the given position

**Example**


```
SELECT element_at(array(1,2,3,4),0);
> 1

SELECT element_at(array(1,2,3,4),-2);
> 3
```

### FIRST_ELEMENT

**Description**

Returns the first element in an array.

**Example**


```
SELECT first_element(array('a','b','c'));
> a

SELECT first_element(array());
> NULL
```

### FLOAT_ARRAY

**Signature**


```
float_array(nDims)
```

**Description**

Returns an array<float> of nDims elements.

### LAST_ELEMENT

**Description**

Return the last element in an array.

**Example**


```
SELECT last_element(array('a','b','c'));
> c
```

### SELECT_K_BEST

**Signature**


```
select_k_best(array<number> array, const array<number> importance, const int k)
```

**Description**

Returns selected top-k elements as array `<double>`.

### SORT_AND_UNIQ_ARRAY

**Signature**


```
sort_and_uniq_array(array<int>)
```

**Description**

`SORT_AND_UNIQ_ARRAY` takes an array of type int and returns a sorted array in a natural order with duplicate elements eliminated.

**Example**


```
SELECT sort_and_uniq_array(array(3,1,1,-2,10));
> [-2,1,3,10]
```

### SUBARRAY

**Signature**


```
subarray(array<int> orignal, int fromIndex, int toIndex)
```

**Description**

`SUBARRAY` returns a slice of the original array between the inclusive fromIndex and the exclusive toIndex.

**Example**


```
SELECT subarray(array(1,2,3,4,5,6), 2,4)
> [3,4]
```

### SUBARRAY_ENDWITH

**Signature**


```
subarray_endwith(array<int|text> original, int|text key)
```

**Description**

`SUBARRAY_ENDWITH` returns an array that ends with the specified key

**Example**


```
SELECT subarray_endwith(array(1,2,3,4), 3);
> [1,2,3]
```

### SUBARRAY_STARTWITH

**Signature**


```
subarray_startwith(array<int|text> original, int|text key)
```

**Description**

`SUBARRAY_STARTWITH` returns an array that starts with the specified key.

**Example**


```
SELECT subarray_startwith(array(1,2,3,4), 2);
> [2,3,4]
```

### TO_STRING_ARRAY

**Signature**


```
to_string_array(array<ANY>)
```

**Description**

`TO_STRING_ARRAY` returns an array of strings.

**Example**


```
SELECT to_string_array(array(1.0,2.0,3.0));
> ["1.0","2.0","3.0"]
```

### TO_ORDERED_LIST

**Signature**


```
to_ordered_list(PRIMITIVE value [, PRIMITIVE key, const string options])
to_ordered_list(value, key [, const string options])
```

**Description**

`TO_ORDERED_LIST` returns list of values sorted by value itself or specific key. For more information, see [Hivemall user guide](https://hivemall.github.io/misc/generic_funcs.html#array).

**Example**


```
WITH t as (
     SELECT 5 as key, 'apple' as value
     UNION ALL
    SELECT 3 as key, 'banana' as value
    UNION ALL
    SELECT 4 as key, 'candy' as value
    UNION ALL
    SELECT 2 as key, 'donut' as value
    UNION ALL
    SELECT 3 as key, 'egg' as value
)
SELECT     -- expected output
    to_ordered_list(value, key, '-reverse'),       -- [apple, candy, (banana, egg | egg, banana), donut] (reverse order)
    to_ordered_list(value, key, '-k 2'),           -- [apple, candy] (top-k)
    to_ordered_list(value, key, '-k 100'),         -- [apple, candy, (banana, egg | egg, banana), dunut]
    to_ordered_list(value, key, '-k 2 -reverse'),  -- [donut, (banana | egg)] (reverse top-k = tail-k)
    to_ordered_list(value, key),                   -- [donut, (banana, egg | egg, banana), candy, apple] (natural order)
    to_ordered_list(value, key, '-k -2'),          -- [donut, (banana | egg)] (tail-k)
    to_ordered_list(value, key, '-k -100'),        -- [donut, (banana, egg | egg, banana), candy, apple]
    to_ordered_list(value, key, '-k -2 -reverse'), -- [apple, candy] (reverse tail-k = top-k)
    to_ordered_list(value, '-k 2'),                -- [egg, donut] (alphabetically)
    to_ordered_list(key, '-k -2 -reverse'),        -- [5, 4] (top-2 keys)
    to_ordered_list(key),      -- [2, 3, 3, 4, 5] (natural ordered keys)
    to_ordered_list(value, key, '-k 2 -kv_map'),   -- {4:"candy",5:"apple"}
    to_ordered_list(value, key, '-k 2 -vk_map')    -- {"candy":4,"apple":5}
FROM
    t;
```

## Bitset Functions

### BITS_COLLECT

**Signature**


```
bits_collect(int|long x)
```

**Description**

`BITS_COLLECT` returns a bit set in array. This function is an aggregate function.

### BITS_OR

**Signature**


```
bits_or(array<long> b1, array<long> b2, ..)
```

**Description**

`BITS_OR` returns a logical OR given bit sets.

**Example**


```
SELECT unbits(bits_or(to_bits(array(1,4)),to_bits(array(2,3))));
> [1,2,3,4]
```

### TO_BITS

**Signature**


```
to_bits(int[] indexes)
```

**Description**

`TO_BITS` returns an bitset representation if the given indexes in long[].

**Example**


```
SELECT to_bits(array(1,2,3,128));
> [14,-9223372036854775808]
```

### UNBITS

**Signature**


```
unbits(long[] bitset)
```

**Description**

`UNBITS` returns a long array of the given bitset representation

**Example**


```
SELECT unbits(to_bits(array(1,4,2,3)));
> [1,2,3,4]
```

## Compression Functions

### DEFLATE

**Signature**


```
deflate(TEXT data [, const int compressionLevel])
```

**Description**

`DEFLATE` returns a compressed BINARY object by using Deflater. The compression level must be within the range [-1,9].

**Example**


```
SELECT base91(deflate('aaaaaaaaaaaaaaaabbbbccc'));
> AA+=kaIM|WTt!+wbGAA
```

### INFLATE

**Signature**


```
inflate(BINARY compressedData)
```

**Description**

`INFLATE` returns a decompressed STRING by using Inflater

**Example**


```
SELECT inflate(unbase91(base91(deflate('aaaaaaaaaaaaaaaabbbbccc'))));
> aaaaaaaaaaaaaaaabbbbccc
```

## Datetime Functions

### SESSIONIZE

**Signature**


```
sessionize(long timeInSec, long thresholdInSec [, String subject])
```

**Description**

`SESSIONIZE` Returns a UUID string of a session.

**Example**


```
SELECT
sessionize(time, 3600, ip_addr) as session_id,
time, ip_addr
FROM (
SELECT time, ipaddr
FROM weblog
DISTRIBUTE BY ip_addr, time SORT BY ip_addr, time DESC
) t1;
```

## JSON Functions

### TO_JSON

**Signature**


```
to_json(object)
```

**Description**

`TO_JSON` returns JSON string of given object.

**Example**


```
select to_json(ARRAY('a', 'b', 'c'));
'["a","b","c"]'
```

### FROM_JSON

**Signature**


```
from_json(STRING json, const string type)
```

**Description**

`FROM_JSON` converts a given JSON string into an object of the specified type.

**Example**


```
select from_json('["a","b","c"]', 'array<string>');
["a","b","c"]
```

## Map Functions

### MAP_EXCLUDE_KEYS

**Signature**


```
map_exclude_keys(Map<K,V> map, array<K> filteringKeys)
```

**Description**

`MAP_EXCLUDE_KEYS` returns the filtered entries of a map that excludes specified keys

**Example**


```
SELECT map_exclude_keys(map(1,'one',2,'two',3,'three'),array(2,3));
> {1:"one"}
```

### MAP_GET

**Signature**


```
map_get(MAP<K> a, K n)
```

**Description**

Return the value corresponding to the key in the map.

**Example**


```
WITH tmp as (
SELECT "one" as key
UNION ALL
SELECT "two" as key
)
SELECT map_get(map("one",1,"two",2),key)
FROM tmp;
> 1
> 2
```

### MAP_GET_SUM

**Signature**


```
map_get_sum(map<int,float> src, array<int> keys)
```

**Description**

`MAP_GET_SUM` returns sum of values that are retrieved by keys.

### 11.4. MAP_INCLUDE_KEYS

**Signature**


```
map_include_keys(Map<K,V> map, array<K> filteringKeys)
```

**Description**

`MAP_INCLUDE_KEYS` returns the filtered entries of a map having specified keys.

**Example**


```
SELECT map_include_keys(map(1,'one',2,'two',3,'three'),array(2,3));
> {2:"two",3:"three"}
```

### MAP_KEY_VALUES

**Signature**


```
array<named_struct<key,value>> map_key_values(Map<K, V> map)
```

**Description**

`MAP_KEY_VALUES` returns a array of key-value pairs in array<named_struct<key,value>>.

**Example**


```
SELECT map_key_values(map("one",1,"two",2));
> [{"key":"one","value":1},{"key":"two","value":2}]
```

### MAP_ROULETTE

**Signature**


```
map_roulette(map<key, number> [, integer seed])
```

**Description**

Return key by weighted random selection.

**Example**


```
-- returns key by weighted random selection
SELECT
map_roulette(to_map(a, b)) -- 25% Tom, 21% Zhang, 54% Wang
FROM ( -- see https://issues.apache.org/jira/browse/HIVE-17406
select 'Wang' as a, 54 as b
union all
select 'Zhang' as a, 21 as b
union all
select 'Tom' as a, 25 as b
) tmp;
> Wang

-- Weight random selection with using filling nulls with the average value
SELECT
map_roulette(map(1, 0.5, 'Wang', null)), -- 50% Wang, 50% 1
map_roulette(map(1, 0.5, 'Wang', null, 'Zhang', null)) -- 1/3 Wang, 1/3 1, 1/3 Zhang

-- NULL will be returned if every key is null
SELECT
map_roulette(map()),
map_roulette(map(null, null, null, null));
> NULL NULL

-- Return NULL if all weights are zero
SELECT
map_roulette(map(1, 0)),
map_roulette(map(1, 0, '5', 0))
> NULL NULL

-- map_roulette does not support non-numeric weights or negative weights.
SELECT map_roulette(map('Wong', 'A string', 'Zhao', 2));
> HiveException: Error evaluating map_roulette(map('Wong':'A string','Zhao':2))
SELECT map_roulette(map('Wong', 'A string', 'Zhao', 2));
> UDFArgumentException: Map value must be greather than or equals to zero: -2
```

### MAP_TAIL_N

**Signature**


```
map_tail_n(map SRC, int N)
```

**Description**

`MAP_TAIL_N` returns the last N elements from a sorted array of SRC.

### MERGE_MAPS

**Signature**


```
merge_maps(Map x)
```

**Description**

`MERGE_MAPS` returns a map that contains the union of an aggregation of maps. An existing value of a key can be replaced with the other duplicate key entry.

**Example**


```
SELECT
merge_maps(m)
FROM (
SELECT map('A',10,'B',20,'C',30)
UNION ALL
SELECT map('A',10,'B',20,'C',30)
) t;
```

### TO_MAP

**Signature**


```
to_map(key, value)
```

**Description**

`TO_MAP` converts two aggregated columns into a key-value map.

**Example**


```
WITH input as (
select 'aaa' as key, 111 as value
UNION all
select 'bbb' as key, 222 as value
)
select to_map(key, value)
from input;

> {"bbb":222,"aaa":111}
```

### TO_ORDERED_MAP

**Signature**


```
to_ordered_map(key, value [, const boolean reverseOrder=false])
```

**Description**

`TO_ORDERED_MAP` converts two aggregated columns into an ordered key-value map.

**Example**


```
with t as (
select 10 as key, 'apple' as value
union all
select 3 as key, 'banana' as value
union all
select 4 as key, 'candy' as value
)
select
to_ordered_map(key, value, true), -- {10:"apple",4:"candy",3:"banana"} (reverse)
to_ordered_map(key, value, 1), -- {10:"apple"} (top-1)
to_ordered_map(key, value, 2), -- {10:"apple",4:"candy"} (top-2)
to_ordered_map(key, value, 3), -- {10:"apple",4:"candy",3:"banana"} (top-3)
to_ordered_map(key, value, 100), -- {10:"apple",4:"candy",3:"banana"} (top-100)
to_ordered_map(key, value), -- {3:"banana",4:"candy",10:"apple"} (natural)
to_ordered_map(key, value, -1), -- {3:"banana"} (tail-1)
to_ordered_map(key, value, -2), -- {3:"banana",4:"candy"} (tail-2)
to_ordered_map(key, value, -3), -- {3:"banana",4:"candy",10:"apple"} (tail-3)
to_ordered_map(key, value, -100) -- {3:"banana",4:"candy",10:"apple"} (tail-100)
from t;
```

## MapReduce Functions

### ROWID

**Signature**


```
string rowid()
```

**Description**

`ROWID` returns a generated row id of a form {TASK_ID}–{SEQUENCE_NUMBER}

**Example**


```
SELECT rowid() as rowid, col1, col2 FROM input
```

### ROWNUM

**Signature**


```
long rownum()
```

**Description**

`ROWNUM returns` a generated row number `sprintf(%d%04d,sequence,taskId)` in a long.

**Example**


```
SELECT rownum() as rowid, col1, col2 FROM input
```

## Math Function

### INFINITY

**Signature**


```
double infinity()
```

**Description**

`INFINITY` returns the constant representing positive infinity.

### IS_FINITE

**Signature**


```
boolean is_finite(number x)
```

**Description**

`IS_FINITE` determines if x is finite.

**Example**


```
SELECT is_finite(333), is_finite(infinity());
> true false
```

### IS_INFINITE

**Signature**


```
boolean is_infinite(number x)
```

**Description**

`IS_INFINITE` determines if x is infinite.

**Example**


```
SELECT is_infinite(333), is_infinite(infinity());
> false true
```

### IS_NAN

**Signature**


```
boolean is_nan(number x)
```

**Description**

`IS_NAN` determines if x is not-a-number.

**Example**


```
SELECT is_nan(333), is_nan(nan());
> false true
```

### L2_NORM

**Signature**


```
double l2_norm(number x)
```

**Description**

L2_NORM return an L2 norm of the given input x.

**Example**


```
WITH input as (
select generate_series(1,3) as v
)
select l2_norm(v) as l2norm
from input;

> 19.621416870348583 = sqrt(1^2+2^2+3^2))
```

### NAN

**Signature**


```
double nan()
```

**Description**

`NAN` returns the constant representing not-a-number.

**Example**


```
SELECT nan(), is_nan(nan());
> NaN true
```

### SIGMOID

**Signature**


```
sigmoid(x)
```

**Description**

`SIGMOID` returns 1.0 / (1.0 + exp(-x)).

**Example**


```
WITH input as (
SELECT 3.0 as x
UNION ALL
SELECT -3.0 as x
)
select
1.0 / (1.0 + exp(-x)),
sigmoid(x)
from
input;

> 0.04742587317756678 0.04742587357759476
> 0.9525741268224334 0.9525741338729858
```

## Vector and Matrix Functions

### TRANSPOSE_AND_DOT

**Signature**


```
transpose_and_dot(array<number> X, array<number> Y)
```

**Description**

`TRANSPOSE_AND_DOT` returns dot(X.T, Y) as array<array<double>>, shape = (X.#cols, Y.#cols).

For example, the transpose of an m × n matrix A is the n × m matrix AT whose columns are the rows of A.

![image](/assets/346358.d4a7738d1a469d8d48a26286ece811e4e0a01ecc4c4fec0b159ecba1e4499eda.03d68c04.png)

**Example**


```
WITH input as (
select array(1.0, 2.0, 3.0, 4.0) as x, array(1, 2) as y
UNION ALL
select array(2.0, 3.0, 4.0, 5.0) as x, array(1, 2) as y
)
select
transpose_and_dot(x, y) as xy,
transpose_and_dot(y, x) as yx
from
input;

> [["3.0","6.0"],["5.0","10.0"],["7.0","14.0"],["9.0","18.0"]] [["3.0","5.0","7.0","9.0"],["6.0","10.0","14.0","18.0"]]
```

### VECTOR_ADD

**Signature**


```
vector_add(array<NUMBER> x, array<NUMBER> y)
```

**Description**

`VECTOR_ADD` performs a vector ADD operation.  This call appends the specified element to the end of this vector.

**Example**


```
SELECT vector_add(array(1.0,2.0,3.0), array(2, 3, 4));
> [3.0,5.0,7.0]
```

### VECTOR_DOT

**Signature**


```
vector_dot(array<NUMBER> x, array<NUMBER> y)
```

**Description**

`VECTOR_DOT` performs a vector dot product calculation.

The dot product is the sum of the products of the corresponding entries of the two sequences of numbers. Geometrically, it is the product of the Euclidean magnitudes of the two vectors and the cosine of the angle between them. These definitions are equivalent when using Cartesian coordinates.

**Example**


```
SELECT vector_dot(array(1.0,2.0,3.0),array(2.0,3.0,4.0));
> 20

SELECT vector_dot(array(1.0,2.0,3.0),2);
> [2.0,4.0,6.0]
```

## Sanity Check Functions

### ASSERT

**Signature**


```
assert(boolean condition [, string errMsg])
```

**Description**

`ASSERT` throws HiveException if condition is not met.

**Example**


```
SELECT count(1) FROM stock_price WHERE assert(price > 0.0);
SELECT count(1) FROM stock_price WHERE assert(price > 0.0, 'price MUST be more than 0.0')
```

### RAISE_ERROR

**Signature**


```
raise_error()
raise_error(string errMsg)
```

**Description**

`RAISE_ERROR` throws an error.

**Example**


```
SELECT product_id, price, raise_error('Found an invalid record') FROM xxx WHERE price < 0.0
```

## Timeseries Functions

### MOVING_AVG

**Signature**


```
moving_avg(NUMBER value, const int windowSize)
```

**Description**
`MOVING_AVG` returns moving average of a time series using a given window

**Example**


```
SELECT moving_avg(x, 3) FROM (SELECT explode(array(1.0,2.0,3.0,4.0,5.0,6.0,7.0)) as x) series;
> 1.0
> 1.5
> 2.0
> 3.0
> 4.0
> 5.0
> 6.0
```

## Other Functions

### CONVERT_LABEL

**Signature**


```
convert_label(const int|const float)
```

**Description**

`CONVERT_LABEL` converts from -1|1 to 0.0f|1.0f, or from 0.0f|1.0f to -1|1.

### EACH_TOP_K

**Signature**


```
each_top_k(int K, ANY_PRIMITIVE_TYPE group, double cmpKey, *)
```

**Description**

`EACH_TOP_K` returns top-K values (or tail-K values when k is less than 0) for each group. Group need to be sorted, assuming CLUSTER BY group.

**Example**


```
SELECT
   each_top_k(
      2, class, score,
      class, student -- optional argument(s) to be resulted in addition to rank and score
   ) as (rank, score, class, student) -- rank and score are resulted by the default
FROM (
SELECT class, score, student
FROM table
CLUSTER BY class -- Mandatory for `each_top_k`
) t
```

### GENERATE_SERIES

**Signature**


```
generate_series(const int|bigint start, const int|bigint end)
```

**Description**

`GENERATE_SERIES` generates a series of values, from start to end, similar to [ PostgreSQL’s generate_series](https://hivemall.github.io/misc/generic_funcs.html#others).

**Example**


```
SELECT generate_series(2,4);
> 2
> 3
> 4

SELECT generate_series(5,1,-2);
> 5
> 3
> 1

SELECT generate_series(4,3);
> (no return)

SELECT date_add(current_date(),value),value from (SELECT generate_series(1,3)) t;
> 2018-04-21 1
> 2018-04-22 2
> 2018-04-23 3

WITH input as (
SELECT 1 as c1, 10 as c2, 3 as step
UNION ALL
SELECT 10, 2, -3
)
SELECT generate_series(c1, c2, step) as series
FROM input;
> 1
> 4
> 7
> 10
> 10
> 7
> 4
```

### TRY_CAST

**Signature**


```
try_cast(ANY src, const string typeName)
```

**Description**

`TRY_CAST` explicitly cast a value as a type. Returns null if cast fails.

**Example**


```
SELECT try_cast(array(1.0,2.0,3.0), 'array<string>')
SELECT try_cast(map('A',10,'B',20,'C',30), 'map<string,double>')
```

### X_RANK

**Signature**


```
x_rank(KEY)
```

**Description**

`X_RANK` generates a pseudo sequence number starting from 1 for each key.

### TO_LIBSVM_FORMAT

**Signature**


```
to_libsvm_format(array<string> feautres [, double/integer target, const string options])
```

**Description**

`TO_LIBSVM_FORMAT` returns a string representation in the libsvm format.

**Example**


```
select to_libsvm_format(array(‘apple:3.4’,‘orange:2.1’));
> 6284535:3.4 8104713:2.1
select to_libsvm_format(array(‘apple:3.4’,‘orange:2.1’), ‘-features 10’);
> 3:2.1 7:3.4
select to_libsvm_format(array(‘7:3.4’,‘3:2.1’), 5.0);
> 5.0 3:2.1 7:3.4
```