I want to keep the duplicates in hive when I use collect_set(). Example:
hash_id | num_of_cats
=====================
abcdef 5
abcdef 4
abcdef 3
fndflka 1
fndflka 2
fndflka 3
djsb33 7
djsb33 7
djsb33 7
should return:
hash_agg | cats_aggregate
===========================
abcdef Array<int>(5,4,3)
fndflka Array<int>(1,2,3)
djsb33 Array<int>(7,7,7)