How to limit the number of rows per each item in a Hive QL

Question

Say I have multiple items listed in a where clause.
How do I limit to N for each item in the list?

EX:

select a_id,b,c, count(*), as sumrequests
from table_name
where
a_id in (1,2,3)
group by a_id,b,c
limit 10000

Omkar · Answer 1 · Dec 1, 2018

SELECT a_id, b, c, count(*) as sumrequests
FROM (
    SELECT a_id, b, c, row_number() over (Partition BY a_id) as row
    FROM table_name
    ) rs
WHERE row <= 10000
AND a_id in (1, 2, 3)
GROUP BY a_id, b, c;

Try the above code and This will output up to 10,000 randomly-chosen rows per a_id. You can partition it further if you're looking to group by more than just a_id.

answered Dec 1, 2018 by Omkar
• 69,180 points

One doubt here for the subquery if you can answer

SELECT a_id, b, c, row_number() over (Partition BY a_id) as row FROM table_name

This inner query will be executed for all the rows for each request.
For example if user needs data from 50th row for one request, next user need to see from 100 th row (concept of pagination) so inner query will be executed for each request. Can that be a performance issue.

commented Sep 10, 2019 by Debapriya

Yes @Debapriya, this query will be executed for each request and it may cause performance issue but in such cases we need to choose between time and space. If you want to decrease space complexity(if this query needs to be executed frequently), one way to do this it is by creating another sub-table of the result and then get result from that data. But this will occupy space.