After you get the summary information for S3 read operations (see Step #2), it makes sense to look at file types. Analyzing the object keys you can easily summarize information about compressed files such as .gz files.

Later I will use the Hive metadata information to define whether files named like 00000_0 are uncompressed text or ORC files.

The first step in Amazon S3 monitoring is to check the current state of your S3 buckets and how fast they grow. You can easily get this information from the CloudWatch Management console, running a AWS CLI command or AWS SDK script.

Bucket Size

Here is an example of AWS CLI command to get the size of a bucket for every day within --start-time and --end-time date range:

So you can have 35K+ files generated per day (and there is no a sub-directory for each day), and if you are going to analyze S3 statistics for a long period of time (weeks, months), the performance of your Hive or Presto queries can be very low.

Additionally there is often a lifecycle rule defined to keep logs only for 1-2 days.