AWS Glue Crawler Creates Partition and File Tables

+2 votes
I have a pretty basic s3 setup that I would like to query against using Athena. The data is all stored in one bucket, organized into year/month/day/hour folders.

|--data
|   |--2018
|   |   |--01
|   |   |   |--01
|   |   |   |   |--01
|   |   |   |   |   |--file1.json
|   |   |   |   |   |--file2.json
|   |   |   |   |--02
|   |   |   |   |   |--file3.json
|   |   |   |   |   |--file4.json
...
I then setup an AWS Glue Crawler to crawl s3://bucket/data. The schema in all files is identical. I would expect that I would get one database table, with partitions on the year, month, day, etc.

What I get instead are tens of thousands of tables. There is a table for each file, and a table for each parent partition as well. So far as I can tell, separate tables were created for each file/folder, without a single overarching one where I can query across a large date range.

I followed instructions https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html to the best of my ability, but cannot figure out how to structure my partitions/scanning such that I don't get this huge, mostly worthless dump of data.

Thanks!

Dinesh Singh

dinesh.singh2003@gmail.com
Oct 30, 2019 in AWS by Dinesh
• 140 points
4,745 views

1 answer to this question.

+2 votes

Hi, you need to add this option "Create a single schema for each S3 path" in Schedule(Crawler) and if you have 2 tables you need to add 2 store data(from each table).

I hope this helps!

Earn globally-recognized AWS certification and become certified today.

Thanks!

answered Jan 31, 2020 by Fito
Thanks, @Fito for your contribution.

Please register at Edureka Community and earn credits for every contribution. A contribution could be asking a question, answering, commenting or even upvoting/downvoting an answer or question.

These credits can be used to get a discount on the course. Also, you could become the admin at Edureka Community with certain points.

Cheers!

Related Questions In AWS

0 votes
2 answers

How to skip headers when reading a CSV file in S3 and creating a table in AWS Athena?

Thanks for the answer. This should be clear ...READ MORE

answered Aug 14, 2019 in AWS by athenauserz
11,842 views
+1 vote
1 answer

What are AWS Glue Crawler?

AWS Glue crawler is used to connect ...READ MORE

answered Feb 4, 2019 in AWS by Heena
4,614 views
0 votes
1 answer

Unziiping a tar.gz file in aws s3 bucket and upload it back to s3 using lambda

Hi@khyati, You can do your task using lambda. ...READ MORE

answered Dec 3, 2020 in AWS by MD
• 95,460 points
18,753 views
0 votes
1 answer

How to create Athena tables for dynamic S3 paths using AWS Crawler?

The simplest and most efficient way to ...READ MORE

answered Feb 16, 2022 in AWS by anonymous
1,115 views
0 votes
2 answers

Receiving SMS from users and stores in AWS

As far as I know, receiving international ...READ MORE

answered Aug 21, 2018 in AWS by Priyaj
• 58,020 points
1,635 views
+15 votes
2 answers

Git management technique when there are multiple customers and need multiple customization?

Consider this - In 'extended' Git-Flow, (Git-Multi-Flow, ...READ MORE

answered Mar 27, 2018 in DevOps & Agile by DragonLord999
• 8,450 points
4,041 views
+2 votes
1 answer
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP